Computerized intelligent assistant for conferences

ABSTRACT

A method for facilitating a remote conference includes receiving a digital video and a computer-readable audio signal. A face recognition machine is operated to recognize a face of a first conference participant in the digital video, and a speech recognition machine is operated to translate the computer-readable audio signal into a first text. An attribution machine attributes the text to the first conference participant. A second computer-readable audio signal is processed similarly, to obtain a second text attributed to a second conference participant. A transcription machine automatically creates a transcript including the first text attributed to the first conference participant and the second text attributed to the second conference participant.

CROSS REFERENCE TO RELATED APPLICATIONS

This application is a continuation of U.S. application Ser. No.17/115,293, filed Dec. 8, 2020, titled “COMPUTERIZED INTELLIGENTASSISTANT FOR CONFERENCES”, which itself is a continuation of U.S.application Ser. No. 16/024,503, filed Jun. 29, 2018, titled“Computerized Intelligent Assistant for Conferences,” which claims thebenefit of priority to U.S. Provisional Patent Application Ser. No.62/667,368, filed May 4, 2018, the entirety of which is herebyincorporated herein by reference for all purposes.

BACKGROUND

Individuals and organizations frequently arrange conferences in which aplurality of local and/or remote users participate to share informationand to plan and report on tasks and commitments. Such conferences mayinclude sharing information across multiple different modalities, e.g.,including spoken and textual conversation, shared visual images, shareddigital files, gestures, and non-verbal cues.

SUMMARY

This Summary is provided to introduce a selection of concepts in asimplified form that are further described below in the DetailedDescription. This Summary is not intended to identify key features oressential features of the claimed subject matter, nor is it intended tobe used to limit the scope of the claimed subject matter. Furthermore,the claimed subject matter is not limited to implementations that solveany or all disadvantages noted in any part of this disclosure.

A method for facilitating a remote conference includes receiving adigital video and a computer-readable audio signal. A face recognitionmachine is operated to recognize a face of a first conferenceparticipant in the digital video, and a speech recognition machine isoperated to translate the computer-readable audio signal into a firsttext. An attribution machine attributes the text to the first conferenceparticipant. A second computer-readable audio signal is processedsimilarly, to obtain a second text attributed to a second conferenceparticipant. A transcription machine automatically creates a transcriptincluding the first text attributed to the first conference participantand the second text attributed to the second conference participant.Transcription can be extended to a variety of scenarios to coordinatethe conference, facilitate communication among conference participants,record events of interest during the conference, track whiteboarddrawings and digital files shared during the conference, and moregenerally create a robust record of multi-modal interactions amongconference participants. The conference transcript can be used byparticipants for reviewing various multi-modal interactions and otherevents of interest that happened in the conference. The conferencetranscript can be analyzed to provide conference participants withfeedback regarding their own participation in the conference, otherparticipants, and team/organizational trends.

BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1A-1C depict a computing environment including an exemplarycomputerized conference assistant.

FIG. 2 schematically shows analysis of sound signals by a sound sourcelocalization machine.

FIG. 3 schematically shows beamforming of sound signals by a beamformingmachine.

FIG. 4 schematically shows detection of human faces by a face detectionmachine.

FIG. 5 schematically shows identification of human faces by a faceidentification machine.

FIG. 6 schematically shows an exemplary diarization framework.

FIG. 7 is a visual representation of an example output of a diarizationmachine.

FIG. 8 schematically shows recognition of an utterance by a speechrecognition machine.

FIG. 9 shows an example of diarization by a computerized conferenceassistant.

FIG. 10 shows an example conference transcript.

FIG. 11 schematically shows an exemplary diarization framework in whichspeech recognition machines are downstream from a diarization machine.

FIG. 12 schematically shows an exemplary diarization framework in whichspeech recognition machines are upstream from a diarization machine.

FIG. 13 shows an exemplary conference environment in which acomputerized intelligent assistant coordinates a conference inassistance of a plurality of conference participants.

FIG. 14 shows a method of facilitating a conference with a computerizedintelligent assistant.

FIGS. 15-19 show example use cases in which a computerized intelligentassistant informs users about conference events of interest.

FIG. 20 shows an example use case in which a computerized intelligentassistant provides feedback to a conference participant.

FIG. 21 shows an example use case in which a computerized intelligentassistant helps a user to find a conference room.

FIG. 22 shows an exemplary computing system.

DETAILED DESCRIPTION

The present disclosure relates generally to providing intelligentassistance to conference participants using a computerized intelligentassistant. The conference participants may include in-personparticipants who are physically present at a conference location, aswell as remote participants who participate via remote audio, video,textual, and/or multi-modal interaction with the in-person participants.In some examples natural language inputs, such as conversation amongconference participants, user commands, and other utterances, may bereceived and processed by the computerized intelligent assistant.Natural language inputs may include speech audio, lexical data (e.g.,text), and/or non-verbal cues including hand gestures. In some examples,natural language inputs may be processed as commands by the computerizedintelligent assistant, e.g., in order to control recording and/ormediate conversation between local and/or remote participants. As anexample, a “cut” hand gesture can be used to stop recording; and a“raise hand” gesture may be used to send a remote participant anotification that a local participant is asking permission to speak. Insome examples, data from one or more sensors also may be utilized toprocess the natural language inputs. In some examples, the computerizedintelligent assistant may engage in conversation with conferenceparticipants, e.g., to ask disambiguating questions, provideconfirmation of a received/processed input, and/or to providedescription or directions relating to coordinating the conference. Thecomputerized intelligent assistant may process the natural language datato generate identity, location/position, status/activity, and/or otherinformation related to the conference (e.g., information shared by oneor more of the conference participants during the conference, and/orinformation related to one or more of the conference participants). Theconference assistant may coordinate the start and/or end of theconference based on a conference schedule and based on trackingparticipant arrivals and/or departures. For example, the conferenceassistant may greet conference participants, inform them as to theconference schedule and/or agenda, etc. The conference assistant canrecord and/or transcribe various multi-modal interactions betweenconference participants. For example, the conference assistant can keeptrack of images shared at a whiteboard and can process the images toshow relevant changes to the images while removing occlusion and visualartifacts. The conference assistant can keep track of digital filesbeing shared by the conference participants, including tracking whichregions of the files are being edited at particular moments of theconference. More generally, the conference assistant can track events ofinterest in the conference based on cues such as hand gestures, based onparticipants' names being mentioned, based on discussion of a topic ofinterest to one or more participants, or based on artificialintelligence analysis of any of the other various multi-modalinteractions between conference participants that are tracked by theconference assistant. The various events of interest can be used as anindex in the conference transcript, so that conference participants canreadily find relevant portions of the transcript. Accordingly, theconference assistant facilitates reviewing the conference, e.g., afterthe conference is over, or by a remote conference participant who isunable to attend the conference physically, or by a non-participant whois unable to participate in the conference in real-time.

Furthermore, the conference transcript and other tracked information maybe automatically analyzed in order to coordinate the conference, byproviding a transcript of the conference to conference participants forsubsequent review, tracking arrivals and departures of conferenceparticipants, providing cues to conference participants during theconference, and/or analyzing the information in order to summarize oneor more aspects of the conference for subsequent review.

FIG. 1 shows an example conference environment 100 including threeconference participants 102A, 102B, and 102C meeting around a table 104.A computerized conference assistant 106 is on table 104 ready tofacilitate a meeting between the conference participants. Computerizedconference assistants consistent with this disclosure may be configuredwith a myriad of features designed to facilitate productive meetings.While the following description uses computerized conference assistant106 as an example computer, other computers or combinations of computersmay be configured to utilize the techniques described below. As such,the present disclosure is in no way limited to computerized conferenceassistant 106.

FIG. 1B schematically shows relevant aspects of computerized conferenceassistant 106, each of which is discussed below. Of particularrelevance, computerized conference assistant 106 includes microphone(s)108 and camera(s) 110.

As shown in FIG. 1A, the computerized conference assistant 106 includesan array of seven microphones 108A, 108B, 108C, 108D, 108E, 108F, and108G. As shown in FIG. 1C, these microphones 108 are configured todirectionally record sound and convert the audible sound into acomputer-readable audio signal 112 (i.e., signals 112 a, 112 b, 112 c,112 d, 112 e, 112 f, and 112 g respectively). “Computer-readable signal”and “computer-readable audio signal” may refer herein to any audiosignal which is suitable for further processing by one or more computingdevices, e.g., analog and/or digital electrical signals. Accordingly, ananalog to digital converter and optional digital encoders may be used toconvert the sound into the computer-readable audio signals. In someexamples, a computer-readable signal may be divided into a plurality ofportions for subsequent processing, e.g., by selecting a particularduration and/or temporal portion of the signal, by selecting particularchannels of the signal (e.g., left or right channel, or a particularmicrophone of a microphone array), by selecting particular frequenciesof the signal (e.g., low-pass, high-pass, or band-pass filter), and/orby selecting particular spatial components of the signal, such as bybeamforming the signal as described herein. Microphones 108A-F areequally spaced around the computerized conference assistant 106 andaimed to directionally record sound originating in front of themicrophone. Microphone 108 g is positioned between the other microphonesand aimed upward.

In some implementations, computerized conference assistant 106 includesa 360° camera configured to convert light of one or more electromagneticbands (e.g., visible, infrared, and/or near infrared) into a 360°digital video 114 or other suitable visible, infrared, near infrared,spectral, and/or depth digital video. In some implementations, the 360°camera may include fisheye optics that redirect light from all azimuthalangles around the computerized conference assistant 106 to a singlematrix of light sensors, and logic for mapping the independentmeasurements from the sensors to a corresponding matrix of pixels in the360° digital video 114. In some implementations, two or more cooperatingcameras may take overlapping sub-images that are stitched together intodigital video 114. In some implementations, camera(s) 110 have acollective field of view of less than 3600 and/or two or moreoriginating perspectives (e.g., cameras pointing toward a center of theroom from the four corners of the room). 360° digital video 114 is shownas being substantially rectangular without appreciable geometricdistortion, although this is in no way required.

Returning briefly to FIG. 1B, computerized conference assistant 106includes a sound source localization (SSL) machine 120 that isconfigured to estimate the location(s) of sound(s) based on signals 112.FIG. 2 schematically shows SSL machine 120 analyzing signals 112 a-g tooutput an estimated origination 140 of the sound modeled by signals 112a-g. As introduced above, signals 112 a-g are respectively generated bymicrophones 108 a-g. Each microphone has a different physical positionand/or is aimed in a different direction. Microphones that are fartherfrom a sound source and/or aimed away from a sound source will generatea relatively lower amplitude and/or slightly phase delayed signal 112relative to microphones that are closer to and/or aimed toward the soundsource. As an example, while microphones 108 a and 108 d mayrespectively produce signals 112 a and 112 d in response to the samesound, signal 112 a may have a measurably greater amplitude if therecorded sound originated in front of microphone 108 a. Similarly,signal 112 d may be phase shifted behind signal 112 a due to the longertime of flight (ToF) of the sound to microphone 108 d. SSL machine 120may use the amplitude, phase difference, and/or other parameters of thesignals 112 a-g to estimate the origination 140 of a sound. SSL machine120 may be configured to implement any suitable two- orthree-dimensional location algorithms, including but not limited topreviously-trained artificial neural networks, maximum likelihoodalgorithms, multiple signal classification algorithms, and cross-powerspectrum phase analysis algorithms. Depending on the algorithm(s) usedin a particular application, the SSL machine 120 may output an angle,vector, coordinate, and/or other parameter estimating the origination140 of a sound.

As shown in FIG. 1B, computerized conference assistant 106 also includesa beamforming machine 122. The beamforming machine 122 may be configuredto isolate sounds originating in a particular zone (e.g., a 0-60° arc)from sounds originating in other zones. In the embodiment depicted inFIG. 3 , beamforming machine 122 is configured to isolate sounds in anyof six equally-sized static zones. In other implementations, there maybe more or fewer static zones, dynamically sized zones (e.g., a focused15° arc), and/or dynamically aimed zones (e.g., a 60° zone centered at9°). Any suitable beamforming signal processing may be utilized tosubtract sounds originating outside of a selected zone from a resultingbeamformed signal 150. In implementations that utilize dynamicbeamforming, the location of the various speakers may be used ascriteria for selecting the number, size, and centering of the variousbeamforming zones. As one example, the number of zones may be selectedto equal the number of speakers, and each zone may be centered on thelocation of the speaker (e.g., as determined via face identificationand/or sound source localization). In some implementations beamformingmachine may be configured to independently and simultaneously listen totwo or more different zones, and output two or more different beamformedsignals in parallel. As such, two or more overlapping/interruptingspeakers may be independently processed.

As shown in FIG. 1B, computerized conference assistant 106 includes aface location machine 124 and a face identification machine 126. Asshown in FIG. 4 , face location machine 124 is configured to findcandidate faces 166 in digital video 114. As an example, FIG. 4 showsface location machine 124 finding candidate FACE(1) at 23°, candidateFACE(2) at 178′, and candidate FACE(3) at 303°. The candidate faces 166output by the face location machine 124 may include coordinates of abounding box around a located face image, a portion of the digital imagewhere the face was located, other location information (e.g., 23°),and/or labels (e.g., “FACE(1)”).

Face identification machine 126 optionally may be configured todetermine an identity 168 of each candidate face 166 by analyzing justthe portions of the digital video 114 where candidate faces 166 havebeen found. In other implementations, the face location step may beomitted, and the face identification machine may analyze a largerportion of the digital video 114 to identify faces. FIG. 5 shows anexample in which face identification machine 126 identifies candidateFACE(1) as “Bob,” candidate FACE(2) as “Charlie,” and candidate FACE(3)as “Alice.” While not shown, each identity 168 may have an associatedconfidence value, and two or more different identities 168 havingdifferent confidence values may be found for the same face (e.g.,Bob(88%), Bert (33%)). If an identity with at least a thresholdconfidence cannot be found, the face may remain unidentified and/or maybe given a generic unique identity 168 (e.g., “Guest(42)”). Speech maybe attributed to such generic unique identities.

When used, face location machine 124 may employ any suitable combinationof state-of-the-art and/or future machine learning (ML) and/orartificial intelligence (AI) techniques. Non-limiting examples oftechniques that may be incorporated in an implementation of facelocation machine 124 include support vector machines, multi-layer neuralnetworks, convolutional neural networks (e.g., including spatialconvolutional networks for processing images and/or videos), recurrentneural networks (e.g., long short-term memory networks), associativememories (e.g., lookup tables, hash tables, Bloom Filters, Neural TuringMachine and/or Neural Random Access Memory), unsupervised spatial and/orclustering methods (e.g., nearest neighbor algorithms, topological dataanalysis, and/or k-means clustering) and/or graphical models (e.g.,Markov models, conditional random fields, and/or AI knowledge bases).

In some examples, the methods and processes utilized by face locationmachine 124 may be implemented using one or more differentiablefunctions, wherein a gradient of the differentiable functions may becalculated and/or estimated with regard to inputs and/or outputs of thedifferentiable functions (e.g., with regard to training data, and/orwith regard to an objective function). Such methods and processes may beat least partially determined by a set of trainable parameters.Accordingly, the trainable parameters may be adjusted through anysuitable training procedure, in order to continually improve functioningof the face location machine 124.

Non-limiting examples of training procedures for face location machine124 include supervised training (e.g., using gradient descent or anyother suitable optimization method), zero-shot, few-shot, unsupervisedlearning methods (e.g., classification based on classes derived fromunsupervised clustering methods), reinforcement learning (e.g., deep Qlearning based on feedback) and/or based on generative adversarialneural network training methods. In some examples, a plurality ofcomponents of face location machine 124 may be trained simultaneouslywith regard to an objective function measuring performance of collectivefunctioning of the plurality of components (e.g., with regard toreinforcement feedback and/or with regard to labelled training data), inorder to improve such collective functioning. In some examples, one ormore components of face location machine 124 may be trainedindependently of other components (e.g., offline training on historicaldata). For example, face location machine 124 may be trained viasupervised training on labelled training data comprising images withlabels indicating any face(s) present within such images, and withregard to an objective function measuring an accuracy, precision, and/orrecall of locating faces by face location machine 124 as compared toactual locations of faces indicated in the labelled training data.

In some examples, face location machine 124 may employ a convolutionalneural network configured to convolve inputs with one or morepredefined, randomized and/or learned convolutional kernels. Byconvolving the convolutional kernels with an input vector (e.g.,representing digital video 114), the convolutional neural network maydetect a feature associated with the convolutional kernel. For example,a convolutional kernel may be convolved with an input image to detectlow-level visual features such as lines, edges, corners, etc., based onvarious convolution operations with a plurality of differentconvolutional kernels. Convolved outputs of the various convolutionoperations may be processed by a pooling layer (e.g., max pooling) whichmay detect one or more most salient features of the input image and/oraggregate salient features of the input image, in order to detectsalient features of the input image at particular locations in the inputimage. Pooled outputs of the pooling layer may be further processed byfurther convolutional layers. Convolutional kernels of furtherconvolutional layers may recognize higher-level visual features, e.g.,shapes and patterns, and more generally spatial arrangements oflower-level visual features. Some layers of the convolutional neuralnetwork may accordingly recognize and/or locate visual features of faces(e.g., noses, eyes, lips). Accordingly, the convolutional neural networkmay recognize and locate faces in the input image. Although theforegoing example is described with regard to a convolutional neuralnetwork, other neural network techniques may be able to detect and/orlocate faces and other salient features based on detecting low-levelvisual features, higher-level visual features, and spatial arrangementsof visual features.

Face identification machine 126 may employ any suitable combination ofstate-of-the-art and/or future ML and/or AI techniques. Non-limitingexamples of techniques that may be incorporated in an implementation offace identification machine 126 include support vector machines,multi-layer neural networks, convolutional neural networks, recurrentneural networks, associative memories, unsupervised spatial and/orclustering methods, and/or graphical models.

In some examples, face identification machine 126 may be implementedusing one or more differentiable functions and at least partiallydetermined by a set of trainable parameters. Accordingly, the trainableparameters may be adjusted through any suitable training procedure, inorder to continually improve functioning of the face identificationmachine 126.

Non-limiting examples of training procedures for face identificationmachine 126 include supervised training, zero-shot, few-shot,unsupervised learning methods, reinforcement learning and/or generativeadversarial neural network training methods. In some examples, aplurality of components of face identification machine 126 may betrained simultaneously with regard to an objective function measuringperformance of collective functioning of the plurality of components inorder to improve such collective functioning. In some examples, one ormore components of face identification machine 126 may be trainedindependently of other components.

In some examples, face identification machine 126 may employ aconvolutional neural network configured to detect and/or locate salientfeatures of input images. In some examples, face identification machine126 may be trained via supervised training on labelled training datacomprising images with labels indicating a specific identity of anyface(s) present within such images, and with regard to an objectivefunction measuring an accuracy, precision, and/or recall of identifyingfaces by face identification machine 126 as compared to actualidentities of faces indicated in the labelled training data. In someexamples, face identification machine 126 may be trained via supervisedtraining on labelled training data comprising pairs of face images withlabels indicating whether the two face images in a pair are images of asingle individual or images of two different individuals, and withregard to an objective function measuring an accuracy, precision, and/orrecall of distinguishing single-individual pairs fromtwo-different-individual pairs.

In some examples, face identification machine 126 may be configured toclassify faces by selecting and/or outputting a confidence value for anidentity from a predefined selection of identities, e.g., a predefinedselection of identities for whom face images were available in trainingdata used to train face identification machine 126. In some examples,face identification machine 126 may be configured to assess a featurevector representing a face, e.g., based on an output of a hidden layerof a neural network employed in face identification machine 126. Featurevectors assessed by face identification machine 126 for a face image mayrepresent an embedding of the face image in a representation spacelearned by face identification machine 126. Accordingly, feature vectorsmay represent salient features of faces based on such embedding in therepresentation space.

In some examples, face identification machine 126 may be configured toenroll one or more individuals for later identification. Enrollment byface identification machine 126 may include assessing a feature vectorrepresenting the individual's face, e.g., based on one or more imagesand/or video of the individual's face. In some examples, identificationof an individual based on a test image may be based on a comparison of atest feature vector assessed by face identification machine 126 for thetest image, to a previously-assessed feature vector from when theindividual was enrolled for later identification. Comparing a testfeature vector to a feature vector from enrollment may be performed inany suitable fashion, e.g., using a measure of similarity such as cosineor inner product similarity, and/or by unsupervised spatial and/orclustering methods (e.g., approximatively k-nearest neighbor methods).Comparing the test feature vector to the feature vector from enrollmentmay be suitable for assessing identity of individuals represented by thetwo vectors, e.g., based on comparing salient features of facesrepresented by the vectors.

As shown in FIG. 1B, computerized conference assistant 106 includes avoice identification machine 128. The voice identification machine 128is analogous to the face identification machine 126 because it alsoattempts to identify an individual. However, unlike the faceidentification machine 126, which is trained on and operates on videoimages, the voice identification machine is trained on and operates onaudio signals, such as beamformed signal 150 and/or signal(s) 112. TheML and AI techniques described above may be used by voice identificationmachine 128. The voice identification machine outputs voice IDs 170,optionally with corresponding confidences (e.g., Bob(77%)).

FIG. 6 schematically shows an example diarization framework 600 for theabove-discussed components of computerized conference assistant 106.While diarization framework 600 is described below with reference tocomputerized conference assistant 106, the diarization framework may beimplemented using different hardware, firmware, and/or softwarecomponents (e.g., different microphone and/or camera placements and/orconfigurations). Furthermore, SSL machine 120, beamforming machine 122,face location machine 124, and/or face identification machine 128 may beused in different sensor fusion frameworks designed to associate speechutterances with the correct speaker.

In the illustrated implementation, microphones 108 provide signals 112to SSL machine 120 and beamforming machine 122, and the SLL machineoutputs origination 140 to diarization machine 132. In someimplementations, origination 140 optionally may be output to Beamformingmachine 122. Camera 110 provides 360° digital videos 114 to facelocation machine 124 and face identification machine 126. The facelocation machine passes the locations of candidate faces 166 (e.g., 23°)to the beamforming machine 122, which the beamforming machine mayutilize to select a desired zone where a speaker has been identified.The beamforming machine 122 passes beamformed signal 150 to diarizationmachine 132 and to voice identification machine 128, which passes voiceID 170 to the diarization machine 132. Face identification machine 128outputs identities 168 (e.g., “Bob”) with corresponding locations ofcandidate faces (e.g., 23°) to the diarization machine. While not shown,the diarization machine may receive other information and use suchinformation to attribute speech utterances with the correct speaker.

Diarization machine 132 is a sensor fusion machine configured to use thevarious received signals to associate recorded speech with theappropriate speaker. The diarization machine is configured to attributeinformation encoded in the beamformed signal or another audio signal tothe human responsible for generating the corresponding sounds/speech. Insome implementations (e.g., FIG. 11 ), the diarization machine isconfigured to attribute the actual audio signal to the correspondingspeaker (e.g., label the audio signal with the speaker identity). Insome implementations (e.g., FIG. 12 ), the diarization machine isconfigured to attribute speech-recognized text to the correspondingspeaker (e.g., label the text with the speaker identity).

In one nonlimiting example, the following algorithm may be employed:

-   -   Video input (e.g., 360° digital video 114) from start to time t        is denoted as V_(1:t)    -   Audio input from N microphones (e.g., signals 112) is denoted as        A_(1:t) ^([1:N])    -   Diarization machine 132 solves WHO is speaking, at WHERE and        WHEN, by maximizing the following:

$\max\limits_{{who},{angle}}{P\left( {{who},\left. {angle} \middle| A_{1:t}^{\lbrack{1:N}\rbrack} \right.,V_{1:t}} \right)}$

Where P(who, angle|A_(1:t) ^([1:N], V) _(1:t)) is computed byP(who|A_(1:t) ^([1:N]), angle) X P(angle|A_(1:t) ^([1:N])) xP(who,angle|V_(1:t))Where P(who|A_(1:t) ^([1:N]), angle) is the Voice ID 170, which takes Nchannel inputs and selects one beamformed signal 150 according to theangle of candidate face 166; P(angle|A_(1:t) ^([1:N])) is theorigination 140, which takes N channel inputs and predicts which anglemost likely has sound;P(who, angle|V_(1:t)) is the identity 168, which takes the video 114 asinput and predicts the probability of each face showing up at eachangle.

The above framework may be adapted to use any suitable processingstrategies, including but not limited to the ML/AI techniques discussedabove. Using the above framework, the probability of one face at thefound angle is usually dominative, e.g., probability of Bob's face at23° is 99%, and the probabilities of his face at all the other angles isalmost 0%.

FIG. 7 is a visual representation of an example output of diarizationmachine 132. In FIG. 6 , a vertical axis is used to denote WHO (e.g.,Bob) is speaking; the horizontal axis denotes WHEN (e.g., 30.01 s-34.87s) that speaker is speaking; and the depth axis denotes from WHERE(e.g., 23°) that speaker is speaking. Diarization machine 132 may usethis WHO/WHEN/WHERE information to label corresponding segments 604 ofthe audio signal(s) 606 under analysis with labels 608. The segments 604and/or corresponding labels may be output from the diarization machine132 in any suitable format. The output effectively associates speechwith a particular speaker during a conversation among N speakers, andallows the audio signal corresponding to each speech utterance (withWHO/WHEN/WHERE labeling/metadata) to be used for myriad downstreamoperations. One nonlimiting downstream operation is conversationtranscription, as discussed in more detail below. As another example,accurately attributing speech utterances with the correct speaker can beused by an AI assistant to identify who is talking, thus decreasing anecessity for speakers to address an AI assistant with a keyword (e.g.,“Cortana”).

Returning briefly to FIG. 1B, computerized conference assistant 106 mayinclude a speech recognition machine 130. As shown in FIG. 8 , thespeech recognition machine 130 may be configured to translate an audiosignal of recorded speech (e.g., signals 112, beamformed signal 150,signal 606, and/or segments 604) into text 800. In the scenarioillustrated in FIG. 8 , speech recognition machine 130 translates signal802 into the text: “Shall we play a game?”

Speech recognition machine 130 may employ any suitable combination ofstate-of-the-art and/or future natural language processing (NLP), AI,and/or ML techniques. Non-limiting examples of techniques that may beincorporated in an implementation of speech recognition machine 130include support vector machines, multi-layer neural networks,convolutional neural networks (e.g., including temporal convolutionalneural networks for processing natural language sentences), wordembedding models (e.g., GloVe or Word2Vec), recurrent neural networks,associative memories, unsupervised spatial and/or clustering methods,graphical models, and/or natural language processing techniques (e.g.,tokenization, stemming, constituency and/or dependency parsing, and/orintent recognition).

In some examples, speech recognition machine 130 may be implementedusing one or more differentiable functions and at least partiallydetermined by a set of trainable parameters. Accordingly, the trainableparameters may be adjusted through any suitable training procedure, inorder to continually improve functioning of the speech recognitionmachine 130.

Non-limiting examples of training procedures for speech recognitionmachine 130 include supervised training, zero-shot, few-shot,unsupervised learning methods, reinforcement learning and/or generativeadversarial neural network training methods. In some examples, aplurality of components of speech recognition machine 130 may be trainedsimultaneously with regard to an objective function measuringperformance of collective functioning of the plurality of components inorder to improve such collective functioning. In some examples, one ormore components of speech recognition machine 130 may be trainedindependently of other components. In an example, speech recognitionmachine 130 may be trained via supervised training on labelled trainingdata comprising speech audio annotated to indicate actual lexical data(e.g., words, phrases, and/or any other language data in textual form)corresponding to the speech audio, with regard to an objective functionmeasuring an accuracy, precision, and/or recall of correctly recognizinglexical data corresponding to speech audio.

In some examples, speech recognition machine 130 may use an AI and/or MLmodel (e.g., an LSTM and/or a temporal convolutional neural network) torepresent speech audio in a computer-readable format. In some examples,speech recognition machine 130 may represent speech audio input as wordembedding vectors in a learned representation space shared by a speechaudio model and a word embedding model (e.g., a latent representationspace for GloVe vectors, and/or a latent representation space forWord2Vec vectors). Accordingly, by representing speech audio inputs andwords in the learned representation space, speech recognition machine130 may compare vectors representing speech audio to vectorsrepresenting words, to assess, for a speech audio input, a closest wordembedding vector (e.g., based on cosine similarity and/or approximativek-nearest neighbor methods or any other suitable comparison method).

In some examples, speech recognition machine 130 may be configured tosegment speech audio into words (e.g., using LSTM trained to recognizeword boundaries, and/or separating words based on silences or amplitudedifferences between adjacent words). In some examples, speechrecognition machine 130 may classify individual words to assess lexicaldata for each individual word (e.g., character sequences, wordsequences, n-grams). In some examples, speech recognition machine 130may employ dependency and/or constituency parsing to derive a parse treefor lexical data. In some examples, speech recognition machine 130 mayoperate AI and/or ML models (e.g., LSTM) to translate speech audioand/or vectors representing speech audio in the learned representationspace, into lexical data, wherein translating a word in the sequence isbased on the speech audio at a current time and further based on aninternal state of the AI and/or ML models representing previous wordsfrom previous times in the sequence. Translating a word from speechaudio to lexical data in this fashion may capture relationships betweenwords that are potentially informative for speech recognition, e.g.,recognizing a potentially ambiguous word based on a context of previouswords, and/or recognizing a mispronounced word based on a context ofprevious words. Accordingly, speech recognition machine 130 may be ableto robustly recognize speech, even when such speech may includeambiguities, mispronunciations, etc.

Speech recognition machine 130 may be trained with regard to anindividual, a plurality of individuals, and/or a population. Trainingspeech recognition machine 130 with regard to a population ofindividuals may cause speech recognition machine 130 to robustlyrecognize speech by members of the population, taking into accountpossible distinct characteristics of speech that may occur morefrequently within the population (e.g., different languages of speech,speaking accents, vocabulary, and/or any other distinctivecharacteristics of speech that may vary between members of populations).Training speech recognition machine 130 with regard to an individualand/or with regard to a plurality of individuals may further tunerecognition of speech to take into account further differences in speechcharacteristics of the individual and/or plurality of individuals. Insome examples, different speech recognition machines (e.g., a speechrecognition machine (A) and a speech recognition (B)) may be trainedwith regard to different populations of individuals, thereby causingeach different speech recognition machine to robustly recognize speechby members of different populations, taking into account speechcharacteristics that may differ between the different populations.

Labeled and/or partially labelled audio segments may be used to not onlydetermine which of a plurality of N speakers is responsible for anutterance, but also translate the utterance into a texturalrepresentation for downstream operations, such as transcription. FIG. 9shows a nonlimiting example in which the computerized conferenceassistant 106 uses microphones 108 and camera 110 to determine that aparticular stream of sounds is a speech utterance from Bob, who issitting at 23° around the table 104 and saying: “Shall we play a game?”The identities and positions of Charlie and Alice are also resolved, sothat speech utterances from those speakers may be similarly attributedand translated into text.

FIG. 10 shows an example conference transcript 181, which includes textattributed, in chronological order, to the correct speakers.Transcriptions optionally may include other information, like the timesof each speech utterance and/or the position of the speaker of eachutterance. In scenarios in which speakers of different languages areparticipating in a conference, the text may be translated into adifferent language. For example, each reader of the transcript may bepresented a version of the transcript with all text in that reader'spreferred language, even if one or more of the speakers originally spokein different languages. Transcripts generated according to thisdisclosure may be updated in real time, such that new text can be addedto the transcript with the proper speaker attribution responsive to eachnew utterance.

FIG. 11 shows a nonlimiting framework 1100 in which speech recognitionmachines 130 a-n are downstream from diarization machine 132. Eachspeech recognition machine 130 optionally may be tuned for a particularindividual speaker (e.g., Bob) or species of speakers (e.g., Chineselanguage speaker, or English speaker with Chinese accent). In someembodiments, a user profile may specify a speech recognition machine (orparameters thereof) suited for the particular user, and that speechrecognition machine (or parameters) may be used when the user isidentified (e.g., via face recognition). In this way, a speechrecognition machine tuned with a specific grammar and/or acoustic modelmay be selected for a particular speaker. Furthermore, because thespeech from each different speaker may be processed independent of thespeech of all other speakers, the grammar and/or acoustic model of allspeakers may be dynamically updated in parallel on the fly. In theembodiment illustrated in FIG. 11 , each speech recognition machine mayreceive segments 604 and labels 608 for a corresponding speaker, andeach speech recognition machine may be configured to output text 800with labels 608 for downstream operations, such as transcription.

FIG. 12 shows a nonlimiting framework 1200 in which speech recognitionmachines 130 a-n are upstream from diarization machine 132. In such aframework, diarization machine 132 may initially apply labels 608 totext 800 in addition to or instead of segments 604. Furthermore, thediarization machine may consider natural language attributes of text 800as additional input signals when resolving which speaker is responsiblefor each utterance.

FIG. 13 shows an exemplary conference environment 100 in which acomputerized intelligent assistant 1300 coordinates a conference beingheld by a plurality of local and remote participants. Although thefollowing examples depict a conference including four local participants(Anna, Beatrice, Carol, and Dan) and one remote participant (Roger), thesystems and methods of the present disclosure may be used to facilitateany conference including at least one local participant and any suitablenumber of remote participants. Computerized intelligent assistant 1300may incorporate a diarization machine and/or a diarization frameworkconfigured for recognizing speakers and transcribing events during aconference (e.g., diarization framework 600, diarization framework 1100,diarization framework 1200, and/or diarization machine 132). In someembodiments, computerized intelligent assistant 1300 may take the formof computerized conference assistant 106.

“Conference environment” is used herein to refer to any area in relativeproximity to computerized intelligent assistant 1300, whereincomputerized intelligent assistant 1300 is able to collect at least someaudiovisual and/or other relevant data in order to observe conferenceparticipants within the conference environment (e.g., a conference room,office, or any other suitable location for holding a meeting).

“Conference participant” is used herein to refer to any user ofcomputerized intelligent assistant 1300 and/or other computer devicescommunicatively coupled to computerized intelligent assistant 1300 whensuch user is involved in a conference in any capacity. For example, inaddition to local users who physically attend the conference and remoteusers who participate in the conference remotely, “conferenceparticipants” is used herein to refer to conference organizers whoparticipate in the planning and/or scheduling of the conference, evenwhen such conference organizers do not physically or remotelyparticipate in the conference. Similarly, “conference participants” isused herein to refer to prospective participants of a conference (e.g.,users who are invited to the conference), even when such prospectiveparticipants do not actually attend the conference. Similarly,“conference participants” is used herein to refer to individuals who arementioned during a conference (e.g., an individual from a sameorganization as another conference participant), even when suchindividuals do not directly participate in the conference.

Computerized intelligent assistant 1300 includes a microphone, a camera,and a speaker. Computing system 1300 of FIG. 22 provides an exampleplatform for implementing computerized intelligent assistant 1300.Computerized intelligent assistant 1300 is configured to captureaudiovisual information arising anywhere in and/or nearby the conferenceenvironment 100. Computerized intelligent assistant 1300 may beconfigured to spatially locate such captured audiovisual information,for example, by use of a fisheye camera, depth camera, microphone array(e.g., one or more microphones), or any other suitable sensor device(s).When computerized intelligent assistant 1300 includes one or moremicrophones (e.g., within a microphone array), the microphones mayinclude direction-sensitive, position-sensitive, and/or direction- orposition-insensitive microphones. Computerized intelligent assistant1300 may be configured to recognize an identity of a conferenceparticipant based on such captured audiovisual information, e.g., byrecognizing a face appearance and/or voice audio previously associatedwith the conference participant.

Returning to FIG. 13 , computerized intelligent assistant 1300 iscommunicatively coupled, via a network 1310 to a backend server 1320.For example, the backend server 1320 may be a conference server computerdevice generally configured to facilitate scheduling a conference,communication among companion devices being used by conferenceparticipants, and/or any other suitable tasks to facilitate theconference in cooperation with computerized intelligent assistant 1300.Network 1310 may be any suitable computer network, e.g., the Internet.Any method or process described herein may be implemented directly oncomputerized intelligent assistant 1300 (e.g., via logic and/or storagedevices of computerized intelligent assistant 1300). Alternately oradditionally, such methods and processes may be at least partiallyenacted by backend server 1320. Backend server 1320 may comprise anysuitable computing device, e.g., a server device in a single enclosure,or a cluster of computing devices. “Computerized intelligent assistant”may be used herein to refer to a single device (e.g., computerizedintelligent assistant 1300 shown on the table in conference environment100) or to a collection of devices implementing methods and processesdescribed herein, e.g., computerized intelligent assistant 1300 incombination with backend server 1320.

User devices of remote and/or local participants (e.g., remote and/orlocal user devices), as well as other computing devices associated witha conference environment (e.g., a display monitor in the conferencingenvironment) may be referred to herein more generally as companiondevices. Although the following description includes examples ofdisplayed content (e.g., notifications, transcripts, and results ofanalysis) at a remote user device 172, such displayed content may bedisplayed at any companion device. Companion devices may include anysuitable devices, e.g., mobile phones, personal computers, tabletdevices, etc. In some examples, companion devices may be communicativelycoupled to computerized intelligent assistant 1300. In some examples,communicative coupling may be via network 1310. In some examples,communication between companion devices and intelligent assistant 120may be mediated by backend server 1320 (e.g., remote user device 172 maybe communicatively coupled to backend server 1320 which in turn mayfacilitate a bidirectional flow of information between remote userdevice 172 and computerized intelligent assistant 1300). Alternately oradditionally, companion devices may communicatively couple tocomputerized intelligent assistant 1300 directly via a wired and/orwireless connection, e.g., via Bluetooth®.

Coordinating a conference including local and/or remote users mayrequire computer-recognizing and tracking various data regarding theconference, before the conference begins and throughout the conference,in order to analyze such data and provide results of such analysis toconference participants in the form of notification messages,transcripts, feedback, etc. FIG. 14 shows a method 200 that may beperformed by computerized intelligent assistant 1300 to facilitate aconference.

At 201, method 200 includes preparing for the conference (e.g., inadvance of a start time of the conference). Accordingly, in advance ofthe conference, computerized intelligent assistant 1300 may receiveinformation pertaining to the conference, e.g., location, schedule, andexpected attendance. At 202, preparing for the conference includesdetermining a conference time and location. Determining the conferencetime and location may be based on receiving scheduling information frombackend server 1320 or from any other computing devices (e.g., from acompanion device of a conference participant, or based on a previousconversation with computerized intelligent assistant 1300 or withanother, different computerized intelligent assistant, wherein suchconversation includes a first conference participant asking computerizedintelligent assistant 1300 to schedule a conference). Such schedulinginformation may be determined in advance by conference participants inany suitable manner, e.g., by adding an entry to a calendar program, orby sending an invitation to other conference participants via email,chat, or any other suitable notification messaging system. In someexamples, a conference schedule and location may be determined inadvance for one or more recurring conferences (e.g., a weekly meeting, abiweekly meeting, or a recurring meeting according to any other suitableschedule). In some examples, a conference schedule and location may bedetermined in a substantially ad-hoc manner shortly before theconference is scheduled to begin, e.g., by sending an invitation for aconference to be held immediately, or by a first conference participantasking computerized intelligent assistant 1300 to call one or more otherconference participants to immediately join the first conferenceparticipant in conference environment 100. In some examples,computerized intelligent assistant 1300 may include a scheduling machineconfigured to determine a time and location of the conference.

In some examples, a location of a conference may be based on adescription of a physical location (e.g., a room in a building, a globalpositioning system (GPS) coordinate, and/or a street address). In someexamples, the location of the conference may be pre-defined by aconference participant in association with a schedule predefined for theconference. Alternately or additionally, a physical location may beinferred based on sensor data of one or more of computerized intelligentassistant 1300 and/or a companion device of a conference participant. Insome examples, a location of the conference may be inferred based on alocation of computerized intelligent assistant 1300 and/or a companiondevice of a conference participant (e.g., based on correlating apre-defined map of rooms in a building with a configuration ofcomputerized intelligent assistant 1300 with network 1310, such as anInternet protocol (IP) or media access control (MAC) address associatedwith a wired and/or wireless connection coupling computerizedintelligent assistant 1300 to network 1310).

At 203, method 200 includes determining participant identities forexpected conference participants. For example, such determination may bedetermined based on the conference schedule, e.g., when determining theconference schedule is based on invitations sent to the conferenceparticipants, such invitations indicate the expected (e.g., invited)conference participants. In some examples, expected participants mayinclude all members of an organization and/or subset (e.g., departmentor team) of the organization. In some examples, expected participantsmay be inferred based on past participation, e.g., based on a frequencyof attending a regularly scheduled meeting.

At 204, determining participant identities includes determining apreregistered signature for each participant, where such pre-registeredsignature may be useable to computer-recognize an identity of theparticipant (e.g., based on audiovisual data captured by computerizedintelligent assistant 1300). For example, such signature for aconference participant may include a computer-readable representation ofone or more exemplary audiovisual data, e.g., face photos, voice audiosamples, and/or biometric data (e.g., fingerprint data). In someexamples, the computer-readable representation may include the one ormore exemplary audiovisual data directly (e.g., a face photo). In someexamples, the computer-readable representation may include one or moreidentified features associated with the exemplary audiovisual data(e.g., visual markers indicating a shape and/or position of a facialfeature). In some examples, a pre-registered signature for a conferenceparticipant may include an associated companion device (e.g., a MACaddress of a mobile phone). In some examples, a signature for aconference participant may include an associated user account (e.g., anaccount in a meeting program running on a mobile phone communicativelycoupled to backend server 1320 and/or computerized intelligent assistant1300). In some examples, a pre-registered signature may be available foronly a subset of conference participants, or a pre-registered signaturemay not be available for any conference participants. In some examples,the computerized intelligent assistant 1300 may include an identitymachine configured to determine participant identities for a pluralityof conference participants including a set of remote participants and aset of local participants. Determining a participant identity for aparticipant of the plurality of conference participants may includerecognizing a pre-registered signature for the participant, wherein thepre-registered signature is useable to computer-recognize an identity ofthe participant. In some examples, one or more local and/or remoteconference participants may be recognized by a face identificationmachine based on digital video received from a local and/or remotecomputing device (e.g., a companion device of a conference participant),for example, by operating the face identification machine to recognizeone or more faces of one or more remote conference participants featuredin a digital video captured by a remote companion device of the remoteconference participant.

In some examples, a pre-registered signature for a conferenceparticipant may be retrieved from a secure personal data storage system(e.g., running on backend server 1320), wherein access to signature datafor a conference participant is constrained based on a user credentialand/or an enterprise credential (e.g., prohibiting access to thesignature data for a conference participant by users other than theconference participant, and/or preventing access to the signature databy users outside of an organization to which the conference participantbelongs). In some examples, signature data is only accessed by thesecure personal data storage system and/or backend server 1320 for thepurpose of identifying users in cooperation with computerizedintelligent assistant 1300, and the signature data is not observable orotherwise accessible by conference participants. In some examples, inaddition to being stored in a secure personal data storage system and/orbackend server 1320, signature data is stored in one or more otherlocations (e.g., in the form of private signature data on a companiondevice of a user, enterprise signature data on an enterprise server, orin any other suitable location). The above-described approaches tohandling (e.g., storing, securing, and/or accessing) signature data arenon-limiting, exemplary approaches to handling sensitive data (e.g.,private, confidential and/or personal data). A computerized intelligentassistant according to the present disclosure may utilize theseexemplary approaches, and/or any other suitable combination ofstate-of-the-art and/or future methods for handling sensitive data.

The methods herein, which involve the observation of people, may andshould be enacted with utmost respect for personal privacy. Accordingly,the methods presented herein are fully compatible with opt-inparticipation of the persons being observed. In embodiments wherepersonal data (e.g., signature data, raw audiovisual data featuring aperson, such as video data captured by a camera of computerizedintelligent assistant 1320, and/or processed audiovisual data) iscollected on a local system and transmitted to a remote system forprocessing, the personal data can be transmitted in a secure fashion(e.g., using suitable data encryption techniques). Optionally, thepersonal data can be anonymized. In other embodiments, personal data maybe confined to a local system, and only non-personal, summary datatransmitted to a remote system. In other embodiments, a multi-tierprivacy policy may be enforced, in which different types of data havedifferent levels of access and/or obfuscation/anonymization (e.g.,enterprise biometric signature useable by all enterprise securitysystems to verify identity, but personal profile data only accessible byauthorized users).

At 205, determining participant identities further includes recognizingpreregistered content of interest for a participant. “Content ofinterest” may be used herein to refer to any topic or subject which maybe of interest to a conference participant. Non-limiting examples ofcontent of interest include any of: 1) a word and/or phrase, 2) a task(e.g., an intended task, or a commitment made by one or more conferenceparticipants), 3) an identity of another conference participant (e.g., aname or email address), 4) a digital file (e.g., a particular document),5) analog multimedia and/or audiovisual content (e.g., a particularphoto or diagram, such as a diagram shared on a whiteboard), and/or 6) adate, time, and/or location. In some examples, content of interest for aconference participant may be pre-defined by the conference participant,by any other conference participant, or by another user in anorganization associated with the conference participant (e.g., theconference participant's supervisor). In some examples, content ofinterest for a conference participant may be inferred based on previousinteraction of the conference participant with computerized intelligentassistant 1300 and/or with computer services communicatively coupled tocomputerized intelligent assistant 1300 (e.g., another, differentcomputerized intelligent assistant, an email program, and/or anote-taking program). In some examples, content of interest for aconference participant may be inferred based on a personal preference ofthe conference participant (e.g., wherein such personal preference isestablished through previous interactions with one or more computerservices communicatively coupled to computerized intelligent assistant1300). In some examples, content of interest for a conferenceparticipant may be inferred based on a current context of the conferenceparticipant, wherein such current context may be recognized based onprevious interactions with one or more computer services communicativelycoupled to computerized intelligent assistant 1300. In some examples,content of interest for a conference participant may be inferred basedon a job title and/or role of the conference participant. In someexamples, content of interest for a conference participant may be basedon previous conferences including the conference participant, e.g.,based on topics that arose in such previous conferences wherein theconference participant indicated potential interest in the topics byattending conferences at which the topics were mentioned and/or byparticipating in conversations in which the topics were mentioned.

At 211, method 200 further includes automatically creating a transcriptof the conference. The transcript may record and/or otherwise track anysuitable details of the conference. Non-limiting examples of details tobe include in a transcript include: 1) participant arrival anddepartures, 2) conference audio/video, 3) transcribed conversations bylocal and/or remote participants, 4) visual information shared byconference participants (e.g., diagrams, drawings, photographs), 5)digital information shared by conference participants (e.g., documentfiles, multimedia files, web addresses, email addresses, or any otherdigital content) and interaction with the shared digital information byconference participants (e.g., clicking on a next slide in apresentation), 6) gestures and/or non-verbal cues performed by theparticipants (e.g., hand gestures, laughing, and/or clapping), and/or 7)tag information submitted via companion devices of conferenceparticipants (e.g., indicating a bookmark or point of interest in theconference, or more generally any event occurring at a particular time).Any details included in the transcript may be correlated with atimestamp. Accordingly, the transcript may interleave details of theconference in a temporal order in which such details occurred in theconference. Whenever a detail is recorded, computerized intelligentassistant 1300 may provide a notification to one or more conferenceparticipants in real-time (e.g., a notification message sent to acompanion device describing the recorded detail).

In some examples, computerized intelligent assistant 1300 may include atranscription machine configured to automatically create a transcriptfor the conference based on audiovisual data including video datacaptured by the camera and audio data captured by the microphone.Accordingly, the transcription machine may create a transcript includingarrivals and departures of conference participants recognized in theaudiovisual data, based on participant identities previously determinedfor the conference participants, and based on recognizing theparticipants based on the previously determined identities. In someexamples, recognizing the participants may be performed by a faceidentification machine based on the previously determined identities(e.g., by recognizing a conference participant face based on similarityto a photograph of the conference participant face included in thepreviously determined identity for the conference participant). Forexample, the transcript may include an arrival time indicating a time ofarrival of a conference participant, and/or a departure time indicatinga time of departure of the conference participant. In some examples, thearrival time may be determined based on a time of recognition of aconference participant by the face identification machine.

The transcript created by the transcription machine may further includetranscribed participant conversations for local and remote participantsincluding transcribed speech audio of local participants captured by themicrophone, and multimedia information shared at the conference, whereinthe multimedia information shared at the conference includes analogvisual content shared at a board, and wherein the transcript includes atimestamp indicating a time at which new visual content was added to theboard and a graphical depiction of the new visual content. In anon-limiting example, the transcription machine may incorporate adiarization machine and/or a diarization framework configured fortranscription (e.g., diarization framework 600, diarization framework1100, diarization framework 1300 and/or diarization machine 132).

In some examples, creating the transcript may be based on operating aspeech recognition machine to translate a computer-readable audio signalfeaturing speech audio of a conference participant into a textrepresenting utterances contained in the speech audio. In some examples,creating the transcript may include operating an attribution machine toattribute speech audio and/or text to a conference participant. Forexample, the attribution machine may be configured to recognize aspeaker in speech audio and attribute the speech audio to the conferenceparticipant, so that after the speech audio is translated into text bythe speech recognition machine, the text may be attributed to thespeaker. Alternately or additionally, the attribution machine may beconfigured to recognize a speaker based on text after translation by thespeech recognition machine (e.g., based on word choice, speaking style,and/or any other suitable natural language features of the text). In anon-limiting example, the attribution machine may be configured toattribute a portion of transcript text to each conference participant ofa plurality of conference participants. In some examples, theattribution machine may incorporate a diarization machine and/or adiarization framework configured for transcription (e.g., diarizationframework 600, diarization framework 1100, diarization framework 1300and/or diarization machine 132). Alternately or additionally, anysuitable technique(s) for attributing speech audio and/or text to one ormore speakers may be used to implement the attribution machine.

Furthermore, conference participants may be able to access the fulltranscript recorded so far in real time, during the conference, e.g., toreview details that were previously recorded. In some examples,computerized intelligent assistant 1300 may provide a notificationindicating whether or not it is currently recording a transcript (e.g.,a notification message sent to a companion device, and/or a flashinggreen light during recording). In some examples, conference audio and/orvideo may be retained for a final transcript. In other examples,conference audio and/or video may be analyzed in order to recognizeother details of the conference, and subsequently discarded. In someexamples, conference audio and/or video may be only temporarily retained(e.g., to facilitate review of other details collected in a transcript)and subsequently discarded (e.g., at a predefined future date, or asdirected by a conference participant). In some examples, backend server1320 may be configured to maintain a running transcript of theconference including text attributed to each conference participantand/or other events of interest during the conference. Accordingly,backend server 1320 may be further configured to provide the runningtranscript of the conference to companion devices of conferenceparticipants, e.g., by sending the whole transcript, or by sending oneor more “delta” data, each delta datum indicating a recent additionand/or change to the transcript.

Creating the transcript of the conference at 211 may include trackingparticipant arrivals at 212. Computerized intelligent assistant 1300 maytrack local participant arrivals and remote participant arrivals.“Arrival” may be used herein, with regard to remote participants, torefer to a time at which the remote participant is available forparticipation in the conference (e.g., when the remote participantremotely joins the conference via telephone, audio conference, videoconference, or otherwise).

Computerized intelligent assistant 1300 may be configured to track thearrival of a local participant by recognizing an identity of the localparticipant. For example, FIG. 13 shows computerized intelligentassistant 1300 recognizing conference participant 161 as “Anna.”Computerized intelligent assistant 1300 may recognize a localparticipant as they enter conference environment 100 based onaudiovisual data captured by a camera and/or microphone of computerizedintelligent assistant 1300 and/or a peripheral/cooperating device.Computerized intelligent assistant 1300 may recognize the localparticipant based on a pre-registered signature of the localparticipant, when such pre-registered signature is available.

In some examples, computerized intelligent assistant 1300 may constrainrecognition of an arriving local participant to only recognize localparticipants who are expected to arrive based on the previouslydetermined participant identities of expected participants. Alternatelyor additionally, computerized intelligent assistant 1300 may constrainrecognition of an arriving local participant to only recognize anysuitable set of potential conference participants and/or individuals whomay be likely to enter conference environment 100. For example, suchpotential conference participants and/or individuals may include otherindividuals within an organization associated with the conference,and/or other individuals with offices resident in a building housingconference environment 100. In some examples, computerized intelligentassistant 1300 may recognize one or more different sets of individualsfor recognition, e.g., 1) invited conference participants, 2) colleaguesof the invited conference participants from the same organization whoare likely to drop in to the conference, and/or 3) other individualshaving offices in the building housing conference environment 100. Insome examples, computerized intelligent assistant 1300 may be configuredto prioritize using one or more of the different sets of individuals toattempt recognition, e.g., computerized intelligent assistant 1300 maybe configured to first attempt to recognize an individual as one of theinvited conference participants and to subsequently attempt to recognizethe individual as a colleague from the same organization only if theindividual was not recognized from among the invited conferenceparticipants. Such prioritization of a set of individuals for attemptedrecognition may improve a speed and/or computational efficiency ofrecognizing an individual when the individual is in the prioritized set.

In some examples, a local participant may not be immediatelyrecognizable due to an insufficiency of data being used to identify thelocal participant (e.g., if the local participant's face is occluded, itmay not be feasible to identify the local participant based on apre-registered signature comprising a face visual appearance; similarly,if the local participant's companion device is turned off, informationassociated with the companion device may not be available for use inidentification). In some examples, pre-registered participant identitydata for a conference participant may not be available or may beincorrect or otherwise insufficient for identifying the conferenceparticipant. Accordingly, computerized intelligent assistant 1300 mayassign the local participant a guest identity in order to deferidentification of the local participant until more data is available. Inan example, a first local participant is not immediately identifiablebased on a face appearance (e.g., because no pre-registered faceappearance data is available for the participant), but upon enteringconference environment 100, a second local participant may greet thefirst local participant by name. Accordingly, computerized intelligentassistant 1300 may recognize the name of the first local participantbased on audio data collected at a microphone of computerizedintelligent assistant 1300, and identify the first local participantbased on the name (e.g., by correlating the name with a name of aninvited conference participant).

In an example, a conference participant may not be initially recognizedupon entering conference environment 100, and computerized intelligentassistant 1300 may prompt the local participant to expressly providefurther identifying information, e.g., by asking the local participantto provide a name. In some examples, when a local participant isrecognized, computerized intelligent assistant 1300 may further promptthe local participant to register an identify to facilitate futurerecognition. Such prompting may include any suitable notification, e.g.,a prompt at a companion device, and/or a question posed by computerizedintelligent assistant 1300 via speech audio. Identifying information ofa conference participant may include personal and/or sensitive data suchas photos, voice audio samples, etc. Accordingly, prompting the localparticipant can include expressly informing the local participant as tospecifically what identifying information is being stored for futureuse, and/or specifically how such identifying information may be used.In some examples, an image or audio clip of a guest user captured bycomputerized intelligent assistant may be associated with a “guest”identity and shared with other conference participants, who maypositively identify the guest based on the image or audio clip. Aftersufficient information is available to identify a local participant, anyguest identity associated with the local participant may be replacedand/or updated with a recognized identity.

In an example, computerized intelligent assistant 1300 may be configuredto provide a notification to the conference leader indicating that aconference participant has been detected and is being tracked, whilesuch conference participant has not yet registered a signature. Thenotification provided to the conference leader may include a sample ofaudio/video data associated with the conference participant that hasbeen captured by computerized intelligent assistant 1300. Accordingly,the conference leader may respond to the provided notification byindicating an identity of the conference participant (e.g., by selectinga name, user account, and/or email address associated with theconference participant). Responsive to such selection, computerizedintelligent assistant 1300 may automatically generate a signature (e.g.,based on face image and/or voice audio) and register the signature forthe conference participant. In some examples, computerized intelligentassistant 1300 is configured to automatically generate a signature onlyafter first proposing to the user to do so and receiving affirmativepermission from the user (e.g., by outputting speech audio asking for anatural language response indicating permission to automaticallygenerate the signature). Alternately or additionally, computerizedintelligent assistant 1300 may use the indicated identity for a displayname for the conference participant (e.g., even when no signature hasbeen generated for the conference participant). Computerized intelligentassistant 1300 may additionally use any available images of theconference participant to present a speaker image (e.g., in thetranscript alongside the speaker's name associated with events in thetranscript associated with the speaker). For example, a speaker imagemay be based on an image and/or raw video captured by a camera ofcomputerized intelligent assistant 1300 (e.g., even when computerizedintelligent assistant 1300 is unable to identify the conferenceparticipant based on such image), based on an image provided by anotherconference participant (e.g., based on an identifying image provided bythe conference leader), and/or based on a previously saved image of theconference participant (e.g., a profile image).

In some examples, computerized intelligent assistant 1300 may beconfigured to continually or periodically collect new voice audio andface images to improve signature quality for one or more conferenceparticipants. For example, new voice audio and face images may becollected for a conference participant based on recognizing theconference participant according to a pre-registered signature andrecording voice audio and face images associated with the recognizedconference participant in order to incorporate such voice audio and faceimages in an updated signature. In some examples, a signature may bebased on voice audio sufficient for robustly identifying a conferenceparticipant, while such signature is insufficient for robustlyidentifying the conference participant based on face images.Accordingly, when the conference participant is recognized based onvoice audio, computerized intelligent assistant 1300 may recordadditional face images, so as to automatically improve the signature forthe conference participant based on the face images. Similarly, in someexamples, a signature may be based on face images sufficient forrobustly identifying a conference participant, while such signature isinsufficient for robustly identifying the conference participant basedon voice audio; accordingly, when the conference participant isrecognized based on face images, computerized intelligent assistant 1300may record additional voice audio, so as to automatically improve thesignature for the conference participant based on the voice audio. Inthis manner, signatures quality may be improved and signatures may bekept up-to-date (e.g., with regard to potential changes in faceappearance and/or voice audio of conference participants), whilereducing an enrollment effort for a conference participant to register asignature.

In some examples, computerized intelligent assistant 1300 may beconfigured to request permission from a conference participant to retainand subsequently use signature data, e.g., when computerized intelligentassistant 1300 automatically generates a signature for a previouslyunrecognized conference participant and/or when computerized intelligentassistant 1300 automatically improves a signature for a previouslyregistered conference participant. In some examples, computerizedintelligent assistant 1300 may be configured to request such permissionduring and/or after the conference. In some examples, computerizedintelligent assistant 1300 may be configured to allow a conferenceparticipant to revoke permission and/or ask a conference participant toprovide updated permission at any suitable interval, e.g., according toa schedule. Accordingly, signatures for identifying conferenceparticipants may be kept up-to-date, while also allowing a conferenceparticipant to control storage and usage of a signature identifying theconference participant.

Returning to FIG. 13 , the conference has not yet begun, when a firstconference participant 161 (Anna), arrives in the conference room.Accordingly, computerized intelligent assistant 1300 may determine thatconference participant 161 is Anna. Although other conferenceparticipants are invited to the conference and expected to arriveshortly, such other conference participants have not yet arrived.Accordingly, computerized intelligent assistant may notify localparticipant 161 (Anna) of the situation.

Returning briefly to FIG. 14 , at 231, computerized intelligentassistant may provide a notification message to conference participants.Using the above example, such notification may be provided bycommunicatively coupling to a local user device 171 (in the form of amobile phone) of local participant 161 (Anna) in order to display amessage at local user device 171 indicating that three otherparticipants (Beatrice, Carol, and Dan) are expected to arrive but havenot yet arrived. More generally, “notification” as used herein refers toany suitable notification and/or acknowledgement signal. “Notificationmessage” is used herein to refer to any suitable means of notification,e.g., an electronic message sent via any suitable protocol (e.g., ShortMessage Service (SMS), email, or a chat protocol), an audio messageoutput at companion device, an audio message output at computerizedintelligent assistant 1300 or a different computerized intelligentassistant.

Returning briefly to FIG. 14 , at 241, computerized intelligentassistant 1300 may alternately or additionally provide anacknowledgement signal. Using the above example, an acknowledgementsignal may be used to inform local participant 161 (Anna) of status ofother conference participants, e.g., by outputting speech audioinforming Anna that three other participants (Beatrice, Carol, and Dan)are expected to arrive but have not yet arrived as shown in the speechbubble in FIG. 13 . Alternately or additionally, computerizedintelligent assistant 1300 may provide any other suitable notification,e.g., by showing a red light indicating that the meeting has not yetbegun. More generally, an acknowledgement signal may include anysuitable combination of audio output (e.g., speech) or visible signals(e.g., a color and/or flashing pattern of a light included incomputerized intelligent assistant 1300, visual content output at adisplay included in computerized intelligent assistant 1300, and/orvisual content output at the display of another device (e.g., local userdevice 171)).

The conference also includes one or more remote participants, e.g.,remote participant 162 (Roger). Remote participants may be in any remotelocation, e.g., collaborating from home or collaborating during transit.In some examples, remote participants may be relatively near conferenceenvironment 100, e.g., in an office in a building housing conferenceenvironment 100 or even a local participant that is joining theconference via a network connection. Accordingly, computerizedintelligent assistant 1300 may communicatively couple via network 1310to a remote user device 172 of remote participant 162 (e.g., Roger'stablet device).

Computerized intelligent assistant 1300 may be configured to track thearrival of a remote participant based on the remote participant loggingin to a remote conferencing program (e.g., a messaging application,voice and/or video chat application, or any other suitable interface forremote interaction). Alternately or additionally, computerizedintelligent assistant 1300 may be configured to recognize anavailability status of a remote participant (e.g., based on a status inthe remote conferencing program) and to assume the remote user ispresent if the remote user is indicated to be available, in advance ofthe remote user logging in to a remote conferencing program.Accordingly, computerized intelligent assistant 1300 may provide anotification message to the remote participant inviting the remoteparticipant to log in to the conferencing program, e.g., at a previouslydefined start time of the conference, when asked to do so by a localparticipant, or at any other suitable time. More generally, computerizedintelligent assistant 1300 may be configured to recognize availabilityand/or attendance of a remote participant based on a status/context(e.g., power status or geographic location) of a remote user device ofthe remote participant. In some examples, a conference participant mayauthorize remote user devices to intelligently assess a remote user'savailability based on one or more context signals (e.g., Roger is notavailable when on another phone call or when talking to children, but isavailable when working on a word processing document).

An expanded view 180 of a display of remote user device 172 is shown inFIG. 13 . As shown in the expanded view 180, computerized intelligentassistant 1300 may provide information regarding the conference toremote user device 172 for display. For example, expanded view 180further depicts a graphical user interface (GUI) for remoteparticipation in the conference. The GUI includes transcript entries181, as well as a transcript timeline scrollbar 182 and a chat entry box183. Via the communicative coupling to network 1310 and computerizedintelligent assistant 1300, remote user device 172 receives events fordisplay among transcript entries 181. Accordingly, transcript entries181 show basic details of the conference so far, namely indicating whois invited to the conference, who is already in attendance, anddescription of the process of receiving an invitation to connect to theconference, connecting, and retrieving a transcript so far. Inparticular, transcript entries 181 include a header indicating that theconference is regarding a “Sales and planning meeting” and indicatingthat remote participant 162 (Roger) will be remotely participating.Transcript entries 181 further include an indication of the expected(local and remote) conference participants (listing Anna, Beatrice,Carol, Dan, and Roger). Transcript entries 181 further include anindication of which local participants have already arrived in theconference room, namely local participant 161 (Anna). Therefore, eventhough Anna may not have announced her presence, Roger knows she ispresent.

Although only a small number of transcript entries 181 are shown so far,scrollbar 182 may be used to navigate through a timeline of theconference in order to view past and/or present details of theconference in the transcript. In the remainder of the presentdisclosure, expanded view 180 will be updated to show a small number ofrecent transcript entries 181. In subsequent figures, transcript entries181 will be replaced with more recent entries as though scrollingthrough a transcript; accordingly, remote participant 162 (Roger) mayuse scrollbar 182 to navigate to previous entries (e.g., to display theentries shown in FIG. 15 , after such entries have been replaced withmore recent entries).

In some examples, a conference may have a previously designatedpresenter and/or organizer, referred to herein as a conference leader.For example, the conference leader of the “Sales and planning meeting”is Carol, who is not yet in attendance. Computerized intelligentassistant 1300 may be configured to take note when the conference leaderis present. Accordingly, as shown in expanded view 180, remoteparticipant 162 (Roger)'s display device may receive and display anindication that the conference participants are waiting for Carol, andthis indication may be updated when Carol is present. In some examples,computerized intelligent assistant 1300 may be configured to encouragewaiting to begin a conference until all invited conference participantsare present, until a threshold proportion (e.g., 50%) of invitedconference participants are present, or until particular participants(e.g., the conference leader) are present. In general, computerizedintelligent assistant 1300 may provide any suitable indication to remoteuser device 172 in order to keep remote user 162 (Roger) apprised of theconference attendance and schedule.

Creating a transcript of the conference at 211 further includesrecording conference audio and/or video at 213. Computerized intelligentassistant 1300 may be configured to begin recording audio and/or videoat any suitable time. For example, computerized intelligent assistant1300 may continuously record the conference environment 100. In anotherexample, computerized intelligent assistant 1300 may wait to recorduntil certain criteria are satisfied (e.g., after Carol arrives).

FIG. 15 shows conference environment 100 at a later time, after localparticipant 163 (Beatrice) has arrived, and as local participant 164(Carol) is arriving. Local participant 163 (Beatrice) has set up anadditional companion device, namely local user device 173 in the form ofa laptop computer.

As local participants arrive at conference environment 100, computerizedintelligent assistant 1300 may be configured to greet one or more of thelocal participants based on how many local participants are present. Forexample, computerized intelligent assistant 1300 may be configured togreet only the first local participant, e.g., to inform them that theyare at the right location and on schedule. Accordingly, the first localparticipant may greet and/or converse with subsequently arriving localparticipants, obviating a utility of computerized intelligent assistant1300 to provide such greeting and/or conversation. Accordingly, in theconference environment 100 shown in FIG. 15 , computerized intelligentassistant 1300 may not have provided local participant 163 (Beatrice)with a greeting upon arrival. In some implementations, computerizedintelligent assistant 1300 may be configured to greet a newly-arrivingconference participant only if an already-present conference participantdoes not great the newly-arriving conference participant. In someexamples, computerized intelligent assistant 1300 may be configured togreet each arriving local participant until specific criteria aresatisfied (e.g., until the conference leader arrives or until a certainnumber of participants are present).

Computerized intelligent assistant 1300 may be configured to use adifferent greeting for an arriving local participant based on a role ofthe local participant. In an example, computerized intelligent assistant1300 may be configured to greet a conference leader by asking whether tobegin the conference. In an example, computerized intelligent assistant1300 may be configured to greet a conference leader by asking whether toconnect one or more remote participants. For example, in FIG. 15 ,computerized intelligent assistant 1300 asks local participant 164(Carol) whether to connect an additional remote participant, “Robert”(in addition to the local participants and remote participant 162(Roger) who are already connected). As depicted in FIG. 15 ,computerized intelligent assistant 1300 is configured to interpret localparticipant 164 (Carol)'s response as an indication that the conferenceshould begin (at the same time as the additional remote participant,Robert, is being connected), in lieu of explicitly asking whether tobegin the conference.

Alternately or additionally, computerized intelligent assistant 1300 maybe configured to greet a conference leader by asking whether to send anotification to participants who are not yet present. For example, theconference has four local invitees, of which only three are present atthe time of local participant 164 (Carol)'s arrival; accordingly, sincelocal participant Dan is not yet present, computerized intelligentassistant 1300 could ask Carol whether to remind Dan about theconference (not shown in FIG. 15 ) in addition to asking whether toconnect Robert (as shown in FIG. 15 ). Similarly, computerizedintelligent assistant 1300 may be configured to ask the conferenceleader whether to wait for one or more other members (e.g., to wait forall members to arrive, or to wait for an additional conference leader ordesignated presenter to arrive).

In some examples, one or more local and/or remote participants who arenot invited to the conference or who are otherwise not-yet-attending theconference may be added to the conference after the conference begins.For example, a conference participant may ask computerized intelligentassistant 1300 to invite an additional remote participant to join, e.g.,to include a colleague who has been mentioned in conversation or who isan expert on a topic being mentioned in conversation. Accordingly,computerized intelligent assistant 1300 may send a notification to theremote participant (e.g., for display at a companion device of theremote participant). The notification may include details of theconference so far as recorded in the transcript. For example, if localparticipant 164 (Carol) asks computerized intelligent assistant 1300 toinvite a colleague who is an expert on a topic being mentioned inconversation, a notification sent to the colleague may include thelocation of the ongoing conference, along with an indication that thecolleague was invited to join the ongoing conference by Carol, alongwith one or more sentences, phrases, and/or summaries from thetranscript in which the colleague was mentioned and/or one or moresentences, phrases, and/or summaries from the transcript in which thetopic was mentioned.

In the following description and in subsequent figures (FIGS. 16-19 ),backend server 1320 and network 1310 are no longer shown, althoughcomputerized intelligent assistant 1300 remains communicatively coupledto companion devices (e.g., remote user device 172) via backend server1320 and network 1310. Similarly, in the following description and insubsequent figures (FIGS. 16-19 ), remote participant 162 (Roger) andremote user device 172 are not shown; instead, the subsequent figuresfocus on expanded display 180 of remote user device 172 in order to showinteractions between remote participant 162 and local participants viathe GUI shown in expanded view 180.

At 251 of FIG. 14 , method 200 may further include providing areviewable transcript to conference participants. Such reviewabletranscript may be provided to companion devices of conferenceparticipants to be displayed in real-time (e.g., as shown in theevolving expanded view 180 in FIGS. 13 and 15-19 ). Alternately oradditionally, such reviewable transcript may be provided to conferenceparticipants after the conference has ended. Such reviewable transcriptmay include content substantially similar to the content depicted inexpanded view 180 of FIGS. 13 and 1519 . As described above, and as willbe shown further below, expanded view 180 shows the transcript atvarious different times in the conference. Accordingly, scrollbar 182may be utilized to scroll to different times and/or recognized events(e.g., “E1”) in order to view any details collected and recorded in thetranscript, e.g., any of the details collected when performing method200 or any other details of a conference described herein. In additionto using a scrollbar 182 to scroll to different times and/or events inthe conference, the reviewable transcript may be indexed, searched,and/or filtered by any suitable details, e.g. speaker name, time range,words in transcribed conversation, or any other details recorded in thetranscript as described herein. In some examples, the reviewabletranscript may additionally include reviewable raw audio and/or videocaptured during the conference. In some examples, the reviewable audioand/or video may be associated with conference events, e.g., to allownavigating in the audio and/or video based on the transcript. In someexamples, conference events may be associated with frames and/or shortvideo clips from the reviewable video, and the frames/clips may be shownin the transcript along with conference events, e.g., allowing theframes/clips to be used for navigating in the transcript. Such clipsand/or other aspects of the transcript may be used to access thecomplete audio and/or video recording from desired locations in therecording. In some examples, the reviewable transcript includes one ormore difference images, showing changes to visual information sharedduring the conference along with indications of corresponding times atwhich the visual information was changed. In some examples, thereviewable transcript is configured to allow navigation based on imagesof shared visual information (e.g., difference images). For example,responsive to selection of the difference image, the reviewabletranscript may be configured to navigate to a portion of the transcriptcorresponding to a time at which the visual information was changed asshown in the difference image.

Information displayed in the reviewable transcript may be tailored to aparticular conference participant, by filtering or re-formattingevents/details in the transcript. For example, although FIG. 15 has atranscript entry stating that Anna, Beatrice, and Carol are inattendance as depicted in expanded view 180 of remote user device 172(e.g., Roger's mobile phone), a different companion device belonging toAnna may instead have a similar entry stating that Beatrice, Carol, andRoger are in attendance (e.g., omitting the indication that theconference participant who is viewing the companion device is inattendance). In some examples, information displayed in the reviewabletranscript may be summarized and/or contextualized to draw attention toevents of potential interest to a conference participant.

In an example, when a conference participant arrives at the conferencelate or leaves the conference early, the reviewable transcript may focuson portions of the conference during which the conference participantwas absent. Similarly, the reviewable transcript may focus on specifictimes in the transcript when the conference participant's name, orcontent of interest to the participant, was mentioned. For example, if aconference participant leaves early, the reviewable transcript may focuson a time at which the conference participant's name was mentioned,along with a previous and following sentence, phrase, or summary toprovide context. In some examples, the reviewable transcript may beprovided to all conference participants, even conference participantswho were invited but never showed up, conference participants who weremerely mentioned in the conference, and/or conference participantshaving content of interest that was mentioned in the conference (evenwhen such participants were never invited).

More generally, the reviewable transcript may be analyzed using anysuitable machine learning (ML) and/or artificial intelligence (AI)techniques, wherein such analysis may include, for raw audio observedduring a conference, recognizing text corresponding to the raw audio,and recognizing one or more salient features of the text and/or rawaudio. Non-limiting examples of salient features that may be recognizedby ML and/or AI techniques include 1) an intent (e.g., an intended taskof a conference participant), 2) a context (e.g., a task currently beingperformed by a conference participant), 3) a topic and/or 4) an actionitem or commitment (e.g., a task that a conference participant promisesto perform). More generally, ML and/or AI techniques may be used torecognize any content of interest based on raw audio, raw video, and/orcorresponding text. In some examples, ML and/or AI systems may betrained based on user feedback regarding salient features of raw audioand/or corresponding text. For example, when conference participants usetags submitted via companion devices and/or gestures to flag events ofinterest during a conference, the flagged events may be used, inassociation with raw audio occurring at the time the events wereflagged, as training data for supervised training of ML and/or AIsystems to recognize events which conference participants are likely toflag in future conferences. Training of ML and/or AI systems torecognize salient features may be conducted for a limited set of users(e.g., for an organization or for a team within an organization) or fora larger population of users. Analyzing the reviewable transcript or anyother aspect of the conference may be performed using any suitablecombination of state-of-the-art and/or future ML, AI and/or naturallanguage processing (NLP) techniques, e.g. ML, AI and/or NLP techniquesdescribed above.

In some examples, the reviewable transcript may be provided to otherindividuals instead of or in addition to providing the reviewabletranscript to conference participants. In an example, a reviewabletranscript may be provided to a supervisor, colleague, or employee of aconference participant. In an example, the conference leader or anyother suitable member of an organization associated with the conferencemay restrict sharing of the reviewable transcript (e.g., so that theconference leader's permission is needed for sharing, or so that thereviewable transcript can only be shared within the organization, inaccordance with security and/or privacy policies of the organization).The reviewable transcript may be shared in an unabridged and/or editedform, e.g., the conference leader may initially review the reviewabletranscript in order to redact sensitive information, before sharing theredacted transcript with any suitable individuals. The reviewabletranscript may be filtered to focus on content of interest (e.g., namementions and action items) for any individual receiving the reviewabletranscript.

One or more conference participants (e.g., a conference leader or adesignated reviewer) may review the reviewable transcript in order toedit the transcript, e.g., to correct incorrectly transcribedconversation based on recorded conversation audio, to remove and/orredact transcript entries, and/or to provide identification forconference participants who were not identified or who were incorrectlyidentified. Such corrective review may be done in real-time as theconference transcript is gathered, and/or after the conference hasended.

After the conference, the reviewable transcript may be sent to eachconference participant and/or saved to the computerized intelligentassistant 1300 and/or backend server 1320 for archival and subsequentuse. The reviewable transcript may be saved in association with one ormore computer services, e.g., an email application, a calendarapplication, a note-taking application, and/or a team collaborationapplication.

FIG. 16 depicts conference environment 100 at a later time relative toFIG. 15 . As depicted in FIG. 16 , expanded view 180 shows that thetranscript entries 181 are updated to include further details of theconference so far. Furthermore, scrollbar 182 is updated to includeevents “El” and “E2,” correlated with particular events among transcriptentries 181. For example, event “El” indicates the start of theconference, when local participant 164 (Carol, the conference leader)arrived. In some examples, the GUI depicted in expanded view 180 mayallow remote participant 162 (Roger) to navigate to specific times inthe transcript by selecting an event shown on the scrollbar 182.Although events are depicted herein with generic labels (e.g., “El” and“E2”), a GUI for conference participation and/or transcript review mayinstead use descriptive labels (e.g., words and/or symbols) to indicatea specific type of event, e.g., a name mention, an action item, a sharedfile, or any other suitable event as described herein.

As shown in the transcript, after local participant 164 (Carol, theconference leader) arrived, local participant 164 (Carol) stated thatshe would set up the board while waiting for another local participant165 (Dan). Returning briefly to FIG. 14 , creating the transcript at 211further includes transcribing local participant conversation at 214.Such transcribing of local participant conversation may includecorrelating speech audio of the local participant conversation withtextual transcription of the speech audio, e.g., using natural languageuser interfaces and/or natural language processing machines ofcomputerized intelligent assistant 1300 and/or backend server 1320. Suchtranscribing of local participant conversation may further includecorrelating speech audio of the local participant conversation with anidentity of a local participant, e.g., based on correlating speech audiowith a preregistered signature of the local participant, and/or based oncorrelating a physical location of the speech audio captured by aposition-sensitive microphone (e.g., a microphone array) with a physicallocation of a speaker (e.g., as identified based on recognizing aphysical location of an identified face). Accordingly, the transcriptincludes speech text based on transcribed speech audio captured whenlocal participant 164 (Carol) spoke. “Board” may be used herein to referto a whiteboard 190, or more generally to refer to any suitable mediumfor sharing visual information (e.g., analog multimedia content) withother local participants in conference environment 100, e.g.,chalkboard, paper, computer display, overhead transparency display,and/or overhead camera display.

In some examples, transcribed speech and/or speaker identity informationmay be gathered by computerized intelligent assistant 1300 in real time,in order to build the transcript in real time, and/or in order toprovide notifications to conference participants about the transcribedspeech in real time. In some examples, computerized intelligentassistant 1300 may be configured, for a stream of speech audio capturedby a microphone, to identify a current speaker and to analyze the speechaudio in order to transcribe speech text, substantially in paralleland/or in real time, so that speaker identity and transcribed speechtext may be independently available. Accordingly, computerizedintelligent assistant 1300 may be able to provide notifications to theconference participants in real time (e.g., for display at companiondevices) indicating that another conference participant is currentlyspeaking and including transcribed speech of the other conferenceparticipant, even before the other conference participant has finishedspeaking. Similarly, computerized intelligent assistant 1300 may be ableto provide notifications to the conference participants includingtranscribed speech of another conference participant, even before theother conference participant has been identified and even before theother conference participant has finished speaking.

Computerized intelligent assistant 1300 may be able to capture images ofshared visual information (e.g., from whiteboard 190). Returning brieflyto FIG. 14 , creating the transcript at 211 may further include trackingshared visual information at 215. Tracking shared visual information mayinclude detecting a change to a board or other location at which visualinformation is being shared, e.g., by detecting new visual content suchas a new diagram added to the board. Tracking the shared visualinformation may include correlating each change to the board with atimestamp. Tracking the shared visual information may include enhancingand/or correcting a captured image of the board. Enhancing and/orcorrecting the captured image may include geometric corrections (e.g.,to correct a skew introduced by a point of view of a camera ofcomputerized intelligent assistant 1300 relative to whiteboard 190),correcting a sharpness, brightness, and/or contrast of the board (e.g.,by quantizing colors detected by a camera of computerized intelligentassistant 1300 to a limited number of colors corresponding to the numberof different ink colors used to write on whiteboard 190), and/orperforming optical character recognition to identify text and/or symbolsdrawn on the board. Accordingly, the enhanced and/or corrected image isa graphical depiction of new visual content that was added to the board,which may be saved in the transcript in association with a timestampindicating when the new visual content was added. The transcriptionmachine is configured to recognize visual information being shared byconference participants (e.g., in digital video captured by a camera ofcomputerized intelligent assistant 1300 or in digital video captured bya companion device of a remote conference participant) and to include adigital image representing the visual information in the transcript. Thetranscription machine is further configured to recognize changes to theshared visual information, and accordingly to include difference imagesshowing the changes to the visual information in the transcript, alongwith an indication of a time at which the visual information was changed(e.g., a timestamp for the change).

Accordingly, as depicted in expanded view 180, a GUI for remoteparticipation may include one or more depictions of whiteboard 190 atvarious times throughout the conference. For example, expanded view 180includes two depictions of whiteboard 190 as local participant 164(Carol) adds content to whiteboard 190, namely a first depiction 184 anda second depiction 185 where further content has been added. In someexamples, depictions of whiteboard 190 (e.g., first depiction 184 andsecond depiction 185) may be useable to navigate throughout thetranscript, e.g., remote participant 162 (Roger) may be able to selectfirst depiction 184 to navigate to a time in the transcript correlatedwith a timestamp indicating when the content shown in first depiction184 was added to the board. Similarly, a remote participant may be ableto select a time in the transcript (e.g., using scrollbar 182) andaccordingly, the GUI may show a limited number of depictions of a board,e.g., a board at a previous moment correlated to the time in thetranscript in conjunction with a previous and subsequent depiction ofthe board to provide context. Alternately or additionally, the GUI forremote participation may include a live video depiction of whiteboard190 showing whiteboard 190 as content is added to it, in real time. Insome examples, depictions of a board may be processed to removeocclusions, by depicting the board at moments when it was not occludedand/or by interpolating board content in occluded regions based on theboard content at previous moments when such regions were not occluded.For example, as depicted in FIG. 16 , local participant 164 (Carol) maytemporarily occlude whiteboard 190 while adding content to it, but firstdepiction 184 and second depiction 185 show only the content of theboard. Accordingly, the transcription machine is configured to recognizean occlusion of the shared visual information on the board, and toprocess previously-saved images of the board (e.g., difference imagesshowing changes to the visual information on the board) to create aprocessed image showing the visual information with the occlusionremoved, in order to include the processed image in the transcript.

Returning briefly to FIG. 14 , creating the transcript at 211 furtherincludes recognizing content of interest in the transcript at 216. Suchcontent of interest may include any suitable content (e.g.,pre-registered content of interest for a conference participant, asdescribed above). In some examples, recognizing content of interest mayinclude recognizing a name mention of a participant at 217. For example,as shown in FIG. 16 , when local participant 164 (Carol) mentions localparticipant 165 (Da n)'s name, such mention may be recognized as anevent “E2” of potential interest to local participant 165 (Dan) and/orto other conference participants. In some examples, computerizedintelligent assistant 1300 may send a notification to one or moreconference participants based on recognition of content of interest,e.g., computerized intelligent assistant 1300 may send a notificationmessage to local user device 175 of local participant 165 (Dan), e.g.,to remind Dan of the conference in case he has forgotten about it or incase he failed to receive/acknowledge an invitation. Accordingly, afterreceiving the notification message at local user device 175, localparticipant 165 (Dan) arrives at the conference and apologizes for histardiness. In some examples, recognized name mentions may be shown(e.g., via transcript and/or notifications) to all conferenceparticipants. In other examples, a recognized name mention may be shownonly to a subset of conference participants, e.g., only to remoteparticipants, only to the conference leader, or only to conferenceparticipants whose own name was recognized. In some examples, thetranscript may include an indication of a portion of the transcriptrelated to content of interest for a conference participant, e.g., atimestamp indicating when the content of interest was discussed duringthe conference.

Returning briefly to FIG. 14 , although not depicted in FIGS. 13 and15-19 , recognizing content of interest may further include recognizingan action item at 218. Recognizing an action item may be based onrecognizing a commitment (e.g., when a conference participant promisesto perform a task, or when a first conference participant requests thata second conference participant perform a task), or any other suitabledetail arising in the conference which may indicate that one or moreconference participants should follow up on a particular matter.Accordingly, such conference participants may receive a notificationpertaining to the action item, and/or review events pertaining to theaction item in a transcript.

FIG. 17 depicts conference environment 100 at another later timerelative to FIGS. 15 and 16 , after local participant 165 (Dan) hasfound his seat. As shown in the transcript entries 181, localparticipant 163 (Beatrice) has asked whether remote participant 162(Roger) is on the bus. Accordingly, based on recognizing Roger's namebeing mentioned, transcript entries 181 and scrollbar 182 denote anevent “E3” associated with Beatrice's question. Remote participant 162(Roger) has accordingly filled out chat entry box 183 indicating that heis indeed on the bus.

FIG. 18 depicts conference environment at another later time relative toFIGS. 15-17 . Returning briefly to FIG. 14 , creating the transcriptfurther includes transcribing remote participant conversation at 219.Accordingly, as shown in expanded view 180 in FIG. 18 , transcriptentries 181 are updated to include remote participant 162 (Roger)'sresponse previously entered and sent via chat entry box 183.

In the present disclosure, remote participation is described in terms ofremotely sending text messages via chat entry box 183, but remoteparticipation may more generally include sending audiovisual data (e.g.,voice/video call data) for listening/viewing by other (local and/orremote) conference participants, e.g., by outputting audio data at aspeaker of computerized intelligent assistant 1300 and/or by displayingvideo data at a companion device. Similarly, although expanded view 180of remote user device 172 depicts a text-based interface including atext transcript of the conference, a remote user device may alternatelyor additionally output audiovisual data (e.g., real-time speech audioand video of a local participant who is currently speaking).

Returning briefly to FIG. 14 , creating the transcript at 211 mayfurther include tracking shared digital information at 220. Such shareddigital information may include any suitable digital content, e.g., wordprocessor documents, presentation slides, multimedia files, computerprograms, or any other files being reviewed by conference participants.For example, tracking shared digital information may include tracking atime at which one or more files were shared among conferenceparticipants. In some examples, tracking shared digital information mayinclude tracking a time at which specific regions (e.g., pages or slidesof a presentation, or a timestamp of a multimedia file) of a digitalfile were viewed, edited, or otherwise accessed by the conferenceparticipants. For example, such tracking may enable reviewingpresentation slides alongside transcribed conversation. Accordingly, asshown in transcript entries 181 in FIG. 18 , when Beatrice shares the“SALES_REPORT” file, such sharing is recorded in the transcript at anappropriate time and an event “E4” is generated, allowing navigation towhen the file was shared. Although the foregoing example describes adigital file being shared by a local participant, digital files may beshared by any local or remote participant (e.g., by using a filesubmission interface of a GUI for remote participation, not depicted inFIGS. 13 and 15-19 ). When the transcript includes references to ashared digital content item, the transcript may additionally include acopy of the shared digital content item (e.g., as a digital file).Alternately or additionally, the transcript may additionally include adescriptor of the shared digital content item useable to retrieve thedigital content item (e.g., a uniform resource locator (URL)).Accordingly, transcript events that refer to the shared digital contentitem (e.g., event “E4”) may link to the digital content item or to aportion of the digital content item (e.g., event “E4” links to the“SALES_REPORT” file).

In some examples, shared digital information may be associated with adigital whiteboard. As digital content items are shared throughout theconference, the digital whiteboard may be updated to show content itemsthat have been shared. Conference participants may additionally be ableto add annotations to the digital whiteboard, where annotations mayinclude any suitable content for display along with the shared contentitems, e.g., text, diagrams, and inking annotations more generally. Thedigital whiteboard may be configured to display each shared digitalcontent item in a spatial location, e.g., so as to simulate arrangingdocuments in a physical space. Accordingly, the annotations added to thedigital whiteboard may indicate relationships between shared digitalcontent items (e.g., by drawing an arrow from one digital content itemto another). As with shared visual information and other details of theconference, whenever a digital content item or an annotation is sharedto the digital whiteboard and/or whenever a digital content item on thedigital whiteboard is modified, viewed, or mentioned in conversation,computerized intelligent assistant 1300 may add an event to thetranscript describing the changes to the digital whiteboard and/orshowing a snapshot of the digital whiteboard at a current time. In thismanner, the digital whiteboard may be used to navigate the transcriptand/or the transcript may be used to navigate changes to the digitalwhiteboard, similarly to with shared visual information (e.g., similarlyto how a real whiteboard may be tracked by computerized intelligentassistant 1300). In some examples, the transcription machine isconfigured to receive an indication of a digital file to be shared froma companion device of a conference participant, and accordingly, toinclude an indication that the digital file was shared in thetranscript. In some examples, the transcription machine is configured torecognize when a portion of the file is being accessed by any conferenceparticipant, and accordingly, to include an indication of the portion ofthe file that was accessed and a time at which the digital file wasaccessed.

Transcript entries 181 further include conversation among localparticipants (e.g., local participant 163 (Beatrice) and localparticipant 164 (Carol)), including conversation in which Beatrice'sname is recognized and an event “E5” is generated based on the namemention. Furthermore, as shown in FIG. 18 , local participant 164(Carol) has updated whiteboard 190. Accordingly, in addition to firstdepiction 184 and second depiction 185 of whiteboard 190, expanded view180 includes a third depiction 186 of whiteboard 190.

FIG. 19 shows conference environment 100 at another later time relativeto FIGS. 15-18 . Although not depicted, at this later time, theconference participants have engaged in substantially more conversation,causing the transcript entries 181 shown in expanded view 180 to scrollpast entries shown in previous figures. As shown in the transcriptentries 181, local participant 164 (Carol) proposes looking at the“SALES_REPORT” file previously shared by local participant 163(Beatrice). However, at this time in the conference, local participant161 (Anna) needs to leave due to a prior commitment. Returning brieflyto FIG. 14 , creating the transcript at 211 includes trackingparticipant departures at 221.

Tracking local participant departures may include recognizing aparticipant in similar fashion to as described above with regard totracking participant arrivals, e.g., based on audiovisual data. Trackingparticipant departures may include, for a recognized local participant,tracking a physical location of the local participant (e.g., based onvisual information captured at the camera or based on a companiondevice) and considering the participant to have departed after theirphysical location is a threshold distance outside of conferenceenvironment 100. Similarly, tracking participant departures may include,for a recognized local participant, recognizing that such localparticipant is no longer detectable within audiovisual data (even inabsence of affirmative confirmation that such local participant has leftconference environment 100). Similarly, tracking participant departuresmay include, for a recognized local participant, recognizing that suchlocal participant is likely departing in advance of the localparticipant leaving conference environment 100. Such recognition mayinclude tracking a trajectory of the physical location of the localparticipant (e.g., as the local participant walks towards an exit ofconference environment 100). Such recognition may further includedetecting an audiovisual cue indicating that the local participant islikely leaving, e.g., if the local participant says “goodbye” and wavesto the other participants, and/or if the other participants say“goodbye” to the local participant. Multiple signals may be combined todetect participant departure, and such signals may be analyzed todetermine a confidence of recognizing departure before the transcript isupdated to indicate departure based on the confidence exceeding apredefined threshold; for example, if a local participant waves“goodbye” computerized intelligent assistant 1300 may infer that theparticipant is likely leaving with a first confidence, and if the localparticipant subsequently packs a bag and moves towards the door,computerized intelligent assistant 1300 may infer that the localparticipant is likely leaving with a second, higher confidence thatexceeds the predefined threshold, causing computerized intelligentassistant 1300 to infer that the local participant is indeed leaving.

In some examples, computerized intelligent assistant 1300 may beconfigured to infer that a conference participant is likely leavingbased on a schedule and/or context of the conference participant. In anexample, computerized intelligent assistant 1300 may infer that aconference participant is leaving a first conference based on theparticipant being invited to a second, different conference occurringduring and/or shortly after the first conference. In an example,computerized intelligent assistant 1300 may infer that a conferenceparticipant is leaving a conference based on the conference participantmaking a commitment during the conference, e.g., based on the conferenceparticipant announcing that they will begin a task immediately.Computerized intelligent assistant 1300 may combine the inference that aparticipant is likely leaving based on a schedule and/or context withother signals (e.g., waving “goodbye”) and accordingly may infer thatthe conference participant is leaving with a relatively higherconfidence (e.g., as compared to an inference based only on theconference participant's schedule or as compared to an inference basedonly on signals observed by computerized intelligent assistant 1300 inthe conference environment).

Similar to tracking remote participant arrivals, tracking remoteparticipant departures may be based on a login and/or availabilitystatus of the remote participant, e.g., based on the remote participantexiting from a GUI for remote participation.

Although not depicted in FIGS. 13 and 15-19 , a conference participantmay briefly depart from a conference only to return later, before theend of the conference (e.g., to take a break or to attend to anothermatter); accordingly, recording the transcript throughout the conferencemay include tracking multiple arrivals and departures for eachconference participant. In addition to tracking individual departures at221, computerized intelligent assistant 1300 may be configured to trackwhen all conference participants leave, when a threshold portion (e.g.,50%) of conference participants leave, and/or when the conference leaderleaves. Various departure criteria may be used in order to automaticallyend the conference and/or cease creating the transcript. In someexamples, computerized intelligent assistant 1300 may be configured toprompt conference participants who are still in attendance as to whetherthe meeting should stop or whether recording should continue, whereinsuch prompting may include any suitable notification (e.g., a speechaudio question, or a prompt displayed at a GUI of a companion device).

Returning to FIG. 19 , after local participant 161 (Anna) leaves theconference room, transcript entries 181 include an indication that Annahas left.

Transcript entries 181 further indicate that local participant 163(Beatrice) is viewing a specific page of the previously-shared“SALES_REPORT” file, e.g., on local user device 173.

Returning briefly to FIG. 14 , creating the transcript at 211 mayfurther include tracking a tag submitted via a companion device at 222.For example, a companion device may be configured to interpret aspecific GUI input, gesture, audio command, or any other suitable inputas an indication that a new event should be added to the transcript at acurrent timestamp. Such tag may indicate a time of interest (e.g., abookmark) in the transcript or any other suitable event. In someexamples, companion devices may be configured to recognize multipledifferent tags, each tag corresponding to a different event to record inthe transcript. Accordingly, transcript entries 181 shown in expandedview 180 of FIG. 19 include an indication that remote participant 162(Roger) added a bookmark to the timeline. As with other events, tagsadded via companion devices may indicate events of interest to aparticular user and/or to all users.

Returning briefly to FIG. 14 , creating the transcript at 211 mayfurther include tracking participant gestures at 223 (e.g., handgestures). Such gestures may include previously-defined gestures whichmay indicate an event and/or control behavior of computerizedintelligent assistant 1300. Accordingly, recognizing gestures may beperformed by a gesture recognition machine configured to recognize oneor more gestures. The gesture recognition machine may be implemented viaany suitable combination of ML and/or AI technologies, e.g., using aneural network trained for gesture recognition.

In an example, a hand gesture is an “off-the-record” gesture indicatingthat recording and/or automatically creating the transcript should bestopped. Accordingly, computerized intelligent assistant 1300 may, atleast temporarily, stop automatically creating the transcript responsiveto recognizing the “off-the-record” hand gesture (e.g., by the gesturerecognition machine). After recognizing such gesture, computerizedintelligent assistant 1300 may be configured to stop recording until adifferent “on-the-record” gesture and/or voice command is received. Whengoing “off-the-record,” computerized intelligent assistant 1300 may beconfigured to provide a notification (e.g., an acknowledgement signalsuch as a light turning from green to red) to local participants. Whengoing “off-the-record,” computerized intelligent assistant 1300 may beconfigured to notify remote participants (e.g., by providing anotification message at a companion device). In some examples, atranscript being viewed by local and/or remote participants maytemporarily include “off-the-record” events (e.g., so that remoteparticipants remain apprised of the situation) and such “off-the-record”events may be deleted from backend server 1320, computerized intelligentassistant 1300, and companion devices at a later time. Such later timecould be the end of the conference, when going back on the record, orany other suitable later time (e.g., after 24 hours). Alternately,computerized intelligent assistant 1300 may be configured to omit“off-the-record” events from the transcript entirely. When“off-the-record” events are omitted from the transcript, computerizedintelligent assistant 1300 may provide conference participants with anindication that “off-the-record” events may be occurring. Alternately,computerized intelligent assistant 1300 may not inform conferenceparticipants that the conference is currently “off-the-record,” or mayonly inform a subset of conference participants (e.g., a conferenceleader, only remote participants, or only a previously designated subsetof participants) that the conference is currently “off-the-record.” Inexamples where a companion device of a remote participant is configuredto output audio/video of the conference, when the conference is“off-the-record,” the conference audio/video optionally may be muted/notdisplayed to unauthorized remote participants (e.g., computerizedintelligent assistant 1300 may not send conference audio/video to thecompanion device of the remote participant when the conference is“off-the-record”). In some implementations, authorization foroff-the-record portions of a conference may be set based on usercredentials/privileges, and in some implementations authorization may bedynamically set based on conference participant directive.

In an example, computerized intelligent assistant 1300 is configured torecognize a hand gesture to indicate a request and/or action item, so asto add an event to the transcript. In some examples, computerizedintelligent assistant 1300 may be configured to recognize a plurality ofdifferent predefined gestures to indicate different kinds of events(e.g., similar to different kinds of tags submitted via companiondevices, as described above). For example, the gesture recognitionmachine may recognize a gesture indicating that an event of interestoccurred, and accordingly, responsive to detection of the gesture by thegesture recognition machine, the transcription machine may include inthe transcript an indication that the event of interest occurred.

In an example, computerized intelligent assistant 1300 is configured torecognize a hand gesture in order to mediate conversation between localand/or remote participants. For example, computerized intelligentassistant 1300 may be configured to recognize a raised hand as a gestureindicating that a local participant wishes to interject, andaccordingly, responsive to recognizing a raised hand gesture, mayfacilitate interjection by alerting other participants and/or adjustingrecording.

Creating the transcript at 211 may include recognizing a sentiment at224. For example, recognizing such sentiment may include operating amachine learning classifier previously trained to classify words and/orphrases as positive, negative, and/or associated with a specificsentiment (e.g., “happy” or “angry”). In some examples, the machinelearning classifier may be configured to receive raw audio and/or videodata and to recognize sentiment based on the raw audio data (e.g., basedon tone of voice) and/or based on the raw video data (e.g., based onfacial expressions and/or body language). Alternately or additionally,the machine learning classifier may be configured to receive any othersuitable transcript data automatically recorded at 211, e.g.,transcribed speech audio in the form of text. The transcription machinemay be configured to analyze the transcript to detect words having apredefined sentiment (e.g., positive, negative, “happy”, or any othersuitable sentiment), in order to present a sentiment analysis summary ata companion device of a conference participant, indicating a frequencyof utterance of words having the predefined sentiment.

Creating the transcript at 211 may include recognizing non-verbal cuesat 225. For example, such non-verbal cues may include laughter, raisedvoices, long pauses/silences, applause, interruptions, and any otherfeatures of the timing and/or delivery of conversational content thatmay arise during natural conversation.

Although FIGS. 13 and 15-19 depict non-limiting examples of events thatare recognized/tracked by computerized intelligent assistant 1300 togenerate real-time notifications and/or to add to a reviewabletranscript, any other events or details of conferences described hereinmay be recognized/tracked by computerized intelligent assistant 1300 forthe purpose of generating notifications and recording the reviewabletranscript.

Returning to FIG. 14 , method 200 further includes providing participantfeedback at 261. Such feedback may be based on any suitable detailsobserved throughout the conference at 211 (e.g., based on analyzing aconference transcript or based on analyzing details as they are observedthroughout the conference). In some examples, such feedback may be basedon sentiment recognized at 224 and nonverbal cues recognized at 225.

FIG. 20 shows a non-limiting example of participant feedback 2000 basedon details observed during a conference. For example, participantfeedback 2000 may be displayed at a companion device, saved to backendserver 1320, or otherwise made available to the conference participantsor others. Participant feedback 2000 may be used for self-coaching,e.g., to improve the conference experience and to help conferenceparticipants learn how to work more effectively with each other.

Participant feedback 2000 includes a sentiment analysis summary 2001,including a “word cloud” of sentiment-related words that occurred in thetranscript, visually depicted with a size indicating their frequency ofutterance (e.g., “thanks” was the most frequent sentiment-related wordobserved during the conference).

Participant feedback 2000 further includes an overall mood summary 2002indicating which conference participants expressed various overallmoods. For example, overall mood summary 2002 may be based on afrequency of utterance of sentiment-related words corresponding todifferent sentiments, e.g., an average sentiment. As depicted, Anna,Carol, Robert, and Roger expressed positive sentiment on average,whereas Beatrice expressed happy sentiment on average and Dan expressednegative sentiment on average.

Participant feedback 2000 further includes a participation summary 2003indicating when different conference participants spoke or otherwiseparticipated during the conference (e.g., as a histogram with the X-axisindicating periods of time in the conference and the Y-axis indicatingfrequency of participation during each period of time). Alternately oradditionally, participant feedback may indicate whether each conferenceparticipant was present during the conference (e.g., by visuallypresenting an icon for each participant with a visual indicator such asa check mark for each participant who was present, by visuallypresenting a list of participants who were present and a list ofparticipants who were not present, or by indicating presence and/orabsence of conference participants in any other suitable manner).

Although FIG. 20 depicts participant feedback pertaining to allconference participants, alternately or additionally, participantfeedback may be specific to one conference participant, e.g., asentiment analysis word cloud showing only the sentiment words utteredby the conference participant.

Although not depicted in FIG. 20 , participant feedback may furtherinclude advice based on analysis of sentiment and/or non-verbal cuesrecognized by computerized intelligent assistant 1300. For example, ifAnna frequently interrupted Beatrice, such advice may direct Anna to becareful about interrupting others, along with an indication of times inthe transcript where Anna interrupted Beatrice. Accordingly, Anna may beable to review the transcript to become more cognizant about when shemay be interrupting others. Participant feedback may further includeexemplary interactions where an utterance provoked a specific reaction,e.g., if Carol said something that caused Dan to express negativesentiment, participant feedback given to Carol may indicate what shesaid to cause Dan to express the negative sentiment. Similarly, feedbackmay indicate exemplary interactions that led to various othersentiments, as well as non-verbal cues like raised voices,interruptions, pauses/silences, and conference participants departingthe conference. In some examples, such exemplary interactions may beuseable by computerized intelligent assistant 1300 to identify causes ofconflict during the conference, and/or to identify portions of theconference that went particularly well.

Participant feedback may also include feedback regarding the timingand/or logistics of the conference. For example, such feedback couldinclude calling attention to whether the conference started and/or endedon schedule, along with an indication of which conference participantsshowed up early, showed up late, and/or left early.

Participant feedback may be generated for each individual conferenceparticipant and/or for participants of a particular conference.Alternately or additionally, participant feedback may be aggregated forall conferences held by an organization and/or a team within anorganization. For example, such participant feedback may providecumulative statistics regarding individual participant and/ororganizational behaviors, e.g., by measuring a percentage of meetingsthat start on time, a percentage of meeting participants who remainedsilent throughout a whole meeting, or any other suitable statisticsand/or analysis of details captured in transcripts.

In some examples, computerized intelligent assistant 1300 may include aparticipant feedback machine configured to automatically analyze thetranscript in order to communicatively couple to a companion device of aconference participant, and based on the analysis of the transcript,provide feedback regarding the conference to the conference participant(e.g., participant feedback 2000). In some examples, the feedbackregarding the conference includes one or more of a notification messagesent to the companion device and a reviewable transcript displayable atthe companion device (e.g., reviewable transcript entries 181 as shownin FIGS. 13 and 1519 ).

Computerized intelligent assistant 1300 may assist users in a conferenceenvironment even when no conference is scheduled or in progress in theconference environment. For example, computerized intelligent assistant1300 may be aware of other scheduled conferences (e.g., in differentconference environments or at a different time in the same conferenceenvironment). Computerized intelligent assistant 1300 may cooperate withbackend server 1320 and/or with other, different computerizedintelligent assistants to maintain a shared schedule and/or locationmapping (e.g., floor map) of conferences within an organization oracross multiple organizations. For example, FIG. 21 depicts theconference environment 100 shown in FIGS. 13 and 15-19 , at a later timewhen no conference is being held. An individual 167 (Eric) shows up atconference environment 100 to find an empty conference room.Accordingly, computerized intelligent assistant 1300 may output speechaudio, informing the individual 167 (Eric) that his meeting is in adifferent room. In some examples, computerized intelligent assistant1300 may use a location mapping to give detailed directions (e.g.,conference room is upstairs and to the left). In some examples,computerized intelligent assistant 1300 may recognize that an individualhas arrived substantially early for (e.g., an hour early, or the daybefore) a conference and inform them when the conference will occur. Insome examples, computerized intelligent assistant 1300 may recognizethat an individual has missed a meeting, and accordingly may inform theindividual when the meeting occurred, and/or provide a reviewabletranscript of the meeting.

The methods and processes described herein may be tied to a computingsystem of one or more computing devices. In particular, such methods andprocesses may be implemented as an executable computer-applicationprogram, a network-accessible computing service, anapplication-programming interface (API), a library, or a combination ofthe above and/or other compute resources.

FIG. 22 schematically shows a simplified representation of a computingsystem 1300 configured to provide any to all of the computefunctionality described herein. Computing system 1300 may take the formof one or more personal computers, network-accessible server computers,tablet computers, home-entertainment computers, gaming devices, mobilecomputing devices, mobile communication devices (e.g., smart phone),virtual/augmented/mixed reality computing devices, wearable computingdevices, Internet of Things (I) devices, embedded computing devices,and/or other computing devices. For example, computing system 1300 maybe a computerized intelligent assistant 1300.

Computing system 1300 includes a logic subsystem 1002 and a storagesubsystem 1004. Computing system 1300 further includes a camera 1012 anda microphone 1014. Computing system 1300 may optionally include adisplay subsystem 1008, input/output subsystem 1010, communicationsubsystem 1012, and/or other subsystems not shown in FIG. 22 .

Logic subsystem 1002 includes one or more physical devices configured toexecute instructions. For example, the logic subsystem may be configuredto execute instructions that are part of one or more applications,services, or other logical constructs. The logic subsystem may includeone or more hardware processors configured to execute softwareinstructions. Additionally or alternatively, the logic subsystem mayinclude one or more hardware or firmware devices configured to executehardware or firmware instructions. Processors of the logic subsystem maybe single-core or multi-core, and the instructions executed thereon maybe configured for sequential, parallel, and/or distributed processing.Individual components of the logic subsystem optionally may bedistributed among two or more separate devices, which may be remotelylocated and/or configured for coordinated processing. Aspects of thelogic subsystem may be virtualized and executed by remotely-accessible,networked computing devices configured in a cloud-computingconfiguration.

Storage subsystem 1004 includes one or more physical devices configuredto temporarily and/or permanently hold computer information such as dataand instructions executable by the logic subsystem. When the storagesubsystem includes two or more devices, the devices may be collocatedand/or remotely located. Storage subsystem 1004 may include volatile,nonvolatile, dynamic, static, read/write, read-only, random-access,sequential-access, location-addressable, file-addressable, and/orcontent-addressable devices. Storage subsystem 1004 may includeremovable and/or built-in devices. When the logic subsystem executesinstructions, the state of storage subsystem 1004 may betransformed—e.g., to hold different data.

Aspects of logic subsystem 1002 and storage subsystem 1004 may beintegrated together into one or more hardware-logic components. Suchhardware-logic components may include program- and application-specificintegrated circuits (PASIC/ASICs), program- and application-specificstandard products (PSSP/ASSPs), system-on-a-chip (SOC), and complexprogrammable logic devices (CPLDs), for example.

The logic subsystem and the storage subsystem may cooperate toinstantiate one or more logic machines. For example, logic subsystem1002 and storage subsystem 1004 of computing system 1300 are configuredto instantiate a face identification machine 1020, a speech recognitionmachine 1022, an attribution machine 1024, a transcription machine 1026,and a gesture recognition machine 1028. As used herein, the term“machine” is used to collectively refer to hardware and any software,instructions, and/or other components cooperating with such hardware toprovide computer functionality. In other words, “machines” are neverabstract ideas and always have a tangible form. A machine may beinstantiated by a single computing device, or a machine may include twoor more sub-components instantiated by two or more different computingdevices. In some implementations a machine includes a local component(e.g., software application) cooperating with a remote component (e.g.,cloud computing service). The software and/or other instructions thatgive a particular machine its functionality may optionally be saved asan unexecuted module on a suitable storage device.

Machines may be implemented using any suitable combination ofstate-of-the-art and/or future machine learning (ML), artificialintelligence (AI), and/or natural language processing (NLP) techniques.Non-limiting examples of techniques that may be incorporated in animplementation of one or more machines include support vector machines,multi-layer neural networks, convolutional neural networks (e.g.,including spatial convolutional networks for processing images and/orvideos, temporal convolutional neural networks for processing audiosignals and/or natural language sentences, and/or any other suitableconvolutional neural networks configured to convolve and pool featuresacross one or more temporal and/or spatial dimensions), recurrent neuralnetworks (e.g., long short-term memory networks), associative memories(e.g., lookup tables, hash tables, Bloom Filters, Neural Turing Machineand/or Neural Random Access Memory), word embedding models (e.g., GloVeor Word2Vec), unsupervised spatial and/or clustering methods (e.g.,nearest neighbor algorithms, topological data analysis, and/or k-meansclustering), graphical models (e.g., Markov models, conditional randomfields, and/or AI knowledge bases), and/or natural language processingtechniques (e.g., tokenization, stemming, constituency and/or dependencyparsing, and/or intent recognition).

In some examples, the methods and processes described herein may beimplemented using one or more differentiable functions, wherein agradient of the differentiable functions may be calculated and/orestimated with regard to inputs and/or outputs of the differentiablefunctions (e.g., with regard to training data, and/or with regard to anobjective function). Such methods and processes may be at leastpartially determined by a set of trainable parameters. Accordingly, thetrainable parameters for a particular method or process may be adjustedthrough any suitable training procedure, in order to continually improvefunctioning of the method or process.

Non-limiting examples of training procedures for adjusting trainableparameters include supervised training (e.g., using gradient descent orany other suitable optimization method), zero-shot, few-shot,unsupervised learning methods (e.g., classification based on classesderived from unsupervised clustering methods), reinforcement learning(e.g., deep Q learning based on feedback) and/or generative adversarialneural network training methods. In some examples, a plurality ofmethods, processes, and/or components of systems described herein may betrained simultaneously with regard to an objective function measuringperformance of collective functioning of the plurality of components(e.g., with regard to reinforcement feedback and/or with regard tolabelled training data). Simultaneously training the plurality ofmethods, processes, and/or components may improve such collectivefunctioning. In some examples, one or more methods, processes, and/orcomponents may be trained independently of other components (e.g.,offline training on historical data).

When included, display subsystem 1008 may be used to present a visualrepresentation of data held by storage subsystem 1004. This visualrepresentation may take the form of a graphical user interface (GUI).Display subsystem 1008 may include one or more display devices utilizingvirtually any type of technology. In some implementations, displaysubsystem may include one or more virtual-, augmented-, or mixed realitydisplays.

When included, input subsystem 1010 may comprise or interface with oneor more input devices. An input device may include a sensor device or auser input device. Examples of user input devices include a keyboard,mouse, touch screen, or game controller. In some embodiments, the inputsubsystem may comprise or interface with selected natural user input(NUI) componentry. Such componentry may be integrated or peripheral, andthe transduction and/or processing of input actions may be handled on-or off-board. Example NUI componentry may include one or moremicrophones (e.g., a microphone, stereo microphone, position-sensitivemicrophone and/or microphone array) for speech and/or voice recognition;an infrared, color, stereoscopic, and/or depth camera for machine visionand/or gesture recognition; a head tracker, eye tracker, accelerometer,and/or gyroscope for motion detection and/or intent recognition.

When included, communication subsystem 1012 may be configured tocommunicatively couple computing system 1300 with one or more othercomputing devices. Communication subsystem 1012 may include wired and/orwireless communication devices compatible with one or more differentcommunication protocols. The communication subsystem may be configuredfor communication via personal-, local- and/or wide-area networks.

In an example, a method for facilitating a remote conference comprises:receiving a digital video from a first remote computing device of aplurality of remote computing devices; receiving a firstcomputer-readable audio signal from the first remote computing device;receiving a second computer-readable audio signal from the second remotecomputing device; operating a face identification machine to recognize aface of a first remote conference participant in the digital video;operating a speech recognition machine to 1) translate the firstcomputer-readable audio signal to a first text, and 2) translate thesecond computer-readable audio signal to a second text; operating anattribution machine configured to 1) attribute the first text to thefirst remote conference participant recognized by the faceidentification machine, and 2) attribute the second text to a secondremote conference participant; and operating a transcription machineconfigured to automatically create a transcript of the conference, thetranscript including 1) the first text attributed to the first remoteconference participant, and 2) the second text attributed to the secondremote conference participant. In this example or any other example, theface identification machine is further configured to recognize, for eachremote conference participant of a plurality of remote conferenceparticipants in the digital video, a face of the remote conferenceparticipant; the attribution machine is further configured, for eachremote conference participant of the plurality of remote conferenceparticipants, to attribute a portion of the first text to the remoteconference participant; and the transcript includes, for each remoteconference participant of the plurality of remote conferenceparticipants, the portion of the text attributed to the remoteconference participant. In this example or any other example, thetranscript further includes an arrival time indicating a time of arrivalof the first remote conference participant and a departure timeindicating a time of departure of the first remote conferenceparticipant. In this example or any other example, the arrival time isdetermined based on a time of recognition of the first remote conferenceparticipant by the face identification machine. In this example or anyother example, the transcription machine is configured to: recognizecontent of interest for the first remote conference participant;automatically recognize the content of interest in the transcript; andinclude within the transcript an indication of a portion of thetranscript related to the content of interest. In this example or anyother example, the transcription machine is configured, responsive torecognizing the content of interest in the transcript, to send anotification to a companion device of the first remote conferenceparticipant including the indication of the portion of the transcriptrelated to the content of interest. In this example or any otherexample, the transcription machine is further configured to receive,from a companion device of the first remote conference participant, anindication of a digital file to be shared with the second remoteconference participant, wherein the transcript further includes anindication that the digital file was shared. In this example or anyother example, the transcription machine is further configured torecognize a portion of the digital file being accessed by one or more ofthe first remote conference participant and the second remote conferenceparticipant, and wherein the transcript further includes an indicationof the portion of the digital file that was accessed and a time at whichthe portion of the file was accessed. In this example or any otherexample, the transcription machine is further configured to recognize,in the digital video, visual information being shared by the firstremote conference participant, and wherein the transcript furtherincludes a digital image representing the visual information. In thisexample or any other example, the transcription machine is furtherconfigured to recognize a change to the visual information, and thetranscript further includes a difference image showing the change to thevisual information and an indication of a time at which the visualinformation was changed. In this example or any other example, thetranscription machine is further configured to recognize an occlusion ofthe visual information and to process one or more difference images tocreate a processed image showing the visual information with theocclusion removed; and wherein the transcript further includes theprocessed image. In this example or any other example, the methodfurther comprises visually presenting a reviewable transcript at acompanion device of a remote conference participant, wherein thereviewable transcript includes the difference image showing the changeto the visual information and wherein the reviewable transcript isconfigured, responsive to selection of the difference image, to navigateto a portion of the transcript corresponding to the time at which thevisual information was changed. In this example or any other example,the transcription machine is configured to transcribe speech of a firstconference participant in real time, the method further comprisingpresenting a notification at a companion device of a second conferenceparticipant that the first conference participant is currently speakingand including transcribed speech of the first conference participant. Inthis example or any other example, the transcription machine is furtherconfigured to analyze the transcript to detect words having a predefinedsentiment, the method further comprising presenting a sentiment analysissummary at a companion device of a conference participant, the sentimentanalysis summary indicating a frequency of utterance of words having thepredefined sentiment. In this example or any other example, the methodfurther comprises a gesture recognition machine configured to recognizea gesture by the first remote conference participant indicating an eventof interest, and wherein the transcription machine is configured toinclude an indication that the event of interest occurred responsive todetection of the gesture by the gesture recognition machine.

In an example, a method for facilitating participation in a conferenceby a client device, comprises: receiving a digital video captured by acamera; receiving a computer-readable audio signal captured by amicrophone; operating a face identification machine to recognize a faceof a local conference participant in the digital video; operating aspeech recognition machine to translate the computer-readable audiosignal to text; operating an attribution machine to attribute the textto the local conference participant recognized by the faceidentification machine; sending, to a conference server device, the textattributed to the local conference participant; receiving, from theconference server device, a running transcript of the conferenceincluding the text attributed to the local conference participant, andfurther including different text attributed to a remote conferenceparticipant; and displaying, in real time, new text added to the runningtranscript and attribution for the new text.

In an example, a computerized conference assistant comprises: a cameraconfigured to convert light of one or more electromagnetic bands intodigital video; a face identification machine configured to 1) recognizea first face of a first local conference participant in the digitalvideo, and 2) recognize a second face of a second local conferenceparticipant in the digital video; a microphone array configured toconvert sound into a computer-readable audio signal; a speechrecognition machine configured to translate the computer-readable audiosignal to text; an attribution machine configured to 1) attribute afirst portion of the text to the first local conference participantrecognized by the face identification machine, and 2) attribute a secondportion of the text to the second local conference participantrecognized by the face identification machine; and a transcriptionmachine configured to automatically create a transcript of theconference, the transcript including 1) the first text attributed to thefirst local conference participant, and 2) the second text attributed tothe second local conference participant. In this example or any otherexample, the computerized conference assistant further comprises acommunication subsystem configured to receive a second text attributedto a remote conference participant, wherein the transcription machine isconfigured to add, to the transcript, the second text attributed to theremote conference participant. In this example or any other example, thetranscription machine is further configured to recognize, in the digitalvideo, visual information being shared by a local conferenceparticipant, and wherein the transcript further includes a digital imagerepresenting the visual information. In this example or any otherexample, the computerized conference assistant further comprises agesture recognition machine configured to recognize a hand gesture by alocal conference participant requesting that recording be stopped,wherein the transcription machine is configured to stop creating thetranscript responsive to recognition of the hand gesture by the gesturerecognition machine.

It will be understood that the configurations and/or approachesdescribed herein are exemplary in nature, and that these specificembodiments or examples are not to be considered in a limiting sense,because numerous variations are possible. The specific routines ormethods described herein may represent one or more of any number ofprocessing strategies. As such, various acts illustrated and/ordescribed may be performed in the sequence illustrated and/or described,in other sequences, in parallel, or omitted. Likewise, the order of theabove-described processes may be changed.

The subject matter of the present disclosure includes all novel andnon-obvious combinations and sub-combinations of the various processes,systems and configurations, and other features, functions, acts, and/orproperties disclosed herein, as well as any and all equivalents thereof.

1. At least one computer-storage device embodying computer-usableinstructions which, when executed by at least one processor, implement amethod comprising: obtaining information identifying a participant of ameeting; generating a transcript of the meeting by at least convertingaudio of a conversation by the participant to text; tracking an event ofinterest to the participant occurring during the meeting by at leastgenerating a reviewable transcript based on the transcript and the eventof interest; and providing a notification of the reviewable transcriptto the participant.
 2. The at least one computer-storage device of claim1, wherein the event of interest is recorded in the reviewabletranscript and includes at least one of: arrivals of participants,departures of participants, video captured during the meeting, visualinformation shared by participants, digital information shared byparticipants, data generated by a companion device, gestures performedby participant, and interactions by the participants.
 3. The at leastone computer-storage device of claim 1, wherein the reviewabletranscript includes information correlating the evens of interest withthe transcript based on timestamps included in the transcript.
 4. The atleast one computer-storage device of claim 1, wherein the notificationof the reviewable transcript to the participant is provided to theparticipant during the meeting.
 5. The at least one computer-storagedevice of claim 1, wherein the notification is provided to a companiondevice associated with the participant.
 6. The at least onecomputer-storage device of claim 1, wherein the reviewable transcriptincludes a digital image or a video captured during the meetingcorresponding to the event of interest.
 7. The at least onecomputer-storage device of claim 1, wherein the reviewable transcriptincludes a set of transcript entries indicating information associatedwith the event of interest, at least one transcript entry indicating anarrival of the participant to the meeting.
 8. The at least onecomputer-storage device of claim 7, comprising determining the arrivalof the participant based on at least one of: a signature associated withthe participant; a set of users invited to the meeting, audio captureduring the meeting, video captured during the meeting, and naturallanguage features of the transcript.
 9. A system comprising: at leastone processor; at least one storage device storing computer-usableinstructions which, when executed by the at least one processor,implement operations comprising: determining a meeting including aparticipant has started; identifying the participant; generating areviewable transcript including information relevant to the participantby at least: converting audio of the meeting to text; and detecting aevent of the meeting; and providing a notification of the reviewabletranscript to the participant.
 10. The system of claim 9, wherein theevent includes an arrival of a second participant to the meeting. 11.The system of claim 10, the operations further comprising detecting thearrival of the second participant, wherein generating the reviewabletranscript including the information relevant to the participant isperformed in response to detecting the arrival of the secondparticipant.
 12. The system of claim 9, wherein the information relevantto the participant includes information recorded during an interval oftime during the meeting when the first participant was absent based onan arrival of the participant to the meeting.
 13. The system of claim 9,wherein identifying the participant further comprises identifying theparticipant based on the participant being at least one of: an invitedparticipant, a colleague of the invited participant, and an individualwith access to location the meeting is being held.
 14. The system ofclaim 9, wherein identifying the participant further comprises promptingthe participant to provide further identifying information by at leastproviding a question to a companion device associated with theparticipant.
 15. The system of claim 9, the operations furthercomprising collecting, in response to identifying the participant, audioinformation or video information associated with the participant toupdate a signature associated with the participant, where the signatureis used to identify the participant.
 16. A method comprising:determining a participant of a meeting; detecting an event occurringduring the meeting relevant to the participant; generating a reviewabletranscript for the participant including text converted from audio ofthe meeting and the event; and providing the reviewable transcript tothe participant.
 17. The method of claim 16, wherein the method furthercomprises generating a second reviewable transcript for a secondparticipant including a second event that is distinct from the event.18. The method of claim 16, wherein the reviewable transcript furthercomprises a sentiment analysis associated with the participant andindicating sentiment-related words obtained from the text converted fromthe audio of the meeting.
 19. The method of claim 16, wherein thereviewable transcript further comprises a participation summaryassociated with the participant indicating the event is attributable tothe participant.
 20. The method of claim 16, wherein an event of thesubset of events includes a task assigned to the participant during themeeting.